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EDITORIAL 


Inaugural Editorial for the Journal of Educational Psychology 


My editorship of the Journal of Educational Psychology (JEP) begins with this issue. I am excited and 
pleased to start this journey, as I believe that JEP is without peer in the educational research world. It publishes 
the most important, highest quality research in educational psychology and education more broadly. As I begin 
this venture, I am acutely aware that the job of an editor is not an easy one. As the poet John Wheelock noted, 
it is the “dullest, hardest, most exciting, exasperating and rewarding of perhaps any job in the world” 
(Charlton, 1997, p. 142). Editors play a pivotal role in the research enterprise, vetting, shaping, and improving 
the presentation of the work submitted to them, but it is an arduous process that involves peer review, 
agonizing decisions about what to publish, and painstaking attention to detail. 

Fortunately, I am not making this journey alone. I am joined by an outstanding group of associate editors 
that includes (in alphabetical order) Jill Fitzgerald, Pani Kendeou, Pui-Wa Lei, Dan Robinson, Cary Roseth, 
Tanya Santangelo, Gregg Shraw, Birgit Spinath, and Young Suk-Kim. We are joined by a highly talented, 
diverse, and international board of consulting editors and principal reviewers. Principal reviewers are a new 
addition to JEP. They make a welcomed commitment to serve the journal by reviewing between four and six 
manuscripts a year. Their efforts are recognized in the last issue of each volume year. 

One of my favorite comments about editors comes from Samuel Clemens, who quipped, “How often we 
recall, with regret, that Napoleon once shot at a magazine editor and missed him and killed a publisher. But 
we remember with charity that his intentions were good” (Ayres, 1997, p. 66). At one time or another, I 
suspect all of us have harbored a negative thought or two about editors. Although there are many possible 
reasons for this, one of the most exasperating involves an inordinately long review process. We on the JEP 
editorial team are committed to making sure this does not happen here. Our goal is for authors to receive a 
decision on their manuscript, based on a sound evaluation by their peers, in a timely manner—in 90 days or 
less, with an emphasis on Jess. If a paper is clearly not appropriate for JEP, we will let authors know why 
immediately. 

As the new editor of JEP, I am acutely aware of my responsibilities to the journal, educational psychology, 
and the field of education in general. My predecessors comprise a formidable array of talent and editorial 
wizardry, including Raymond Kuhlen, Wayne Holtzman, Johanna Williams, Samuel Ball, Robert Calfee, Joel 
Levin, Michael Pressley, Karen Harris, and Art Graesser. To quote a former editor, they made JEP the leading 
“outlet in the world for psychologically oriented research in education” (Pressley, 1997, p. 3). They 
accomplished this by publishing high-quality investigations, articles that moved the field forward conceptually 
and empirically, and the strongest interdisciplinary research in education. They encouraged others to submit 
their best work to the journal and made hard decisions about what to publish. We will uphold these traditions. 

This does not mean that maintaining the status quo is our objective. We plan to make JEP even better. How 
do we plan to achieve this goal? At the most basic level, we want to make sure the work submitted and 
published in the journal is as good as it can be. This means that before a study is reviewed for JEP, it must 
meet certain criteria. Before submitting an article to JEP, we encourage authors to examine the Journal Article 
Reporting Standards specified in Volume 63 of the 2008 American Psychologist (APA Publications and 
Communications Board Working Group on Journal Article Reporting Standards, 2008) or the appendix in the 
Publication Manual of the American Psychological Association (American Psychological Association, 2010). 

Some criteria must be met before we send a paper out for review. First, the participants and the setting in 
which the research occurred must be adequately described. Such descriptions are essential to contextualizing 
and interpreting the findings from a study, replicating an investigation, determining generalizability of 
findings, and conducting meta-analyses. 

Second, there must be adequate evidence that measures are reliable and valid. Because reliability is a 
characteristic of the sample and the measure (Crocker & Algina, 1986; Harris, 2003), referencing previous 
evidence to support reliability is not enough in many cases. In these instances, authors must provide evidence 
that the measures used in their study were reliable with their sample. 

A sometimes vexing set of measures in terms of reliability and validity are grades and grade point averages. 
In some instances, grades are based on one or more reliable exams administered to all participants, but too 
often they are based on undefined and likely different procedures that vary by the class or classes students 
completed. Although grades and grade point averages are a legitimate area of study in their own right, we are 
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reluctant to review a paper when the reliability and validity of grades is questionable, especially if they serve 
as the only outcome measures of achievement or academic progress in a study. Before such a study is 
reviewed, it is incumbent on the authors to provide convincing evidence that these measures are, in fact, 
adequate. ‘ 

Third, we would like for JEP to publish even more intervention research than it does now, but before we 
review an intervention study, authors must provide evidence that the treatment was implemented as intended. 
Simply put, trust cannot be placed in findings from a study in which treatment validity was not established. 
Such evidence is essential to any claim that a specific intervention was responsible for observed changes. Of 
course, it is equally important to describe what happened in control and comparison conditions. 

Fourth, authors may be asked to upgrade their statistical analyses before we review a paper. We hope this 
will not occur often, but we are especially sensitive to this issue for large as well as longitudinal databases. 
The process of raising questions and concerns about data analyses commonly occurs during the peer review 
process, but if there is an evident issue, we will ask for it to be resolved before the paper is reviewed. 

As noted earlier, one of our intents in implementing these criteria is to shape and enhance the quality of 
work submitted and published in JEP. Two other important purposes are served as well. First, authors increase 
the probability of receiving a positive review when participants and setting are adequately described, measures 
are reliable and valid, treatment fidelity is established, and appropriate statistical procedures are’ applied. 
Second, reviewers do not spend valuable time reviewing a manuscript missing fundamental information. 

We further require authors submitting manuscripts to JEP to report appropriate effect sizes as well as 
means, standard deviations, and confidence intervals for their variables. More than 10 years ago, Harris (2003) 
made a similar call, but we still receive a sizable number of submissions missing such basic data. 

Another way we plan to make JEP even better is to communicate to the field our interest in publishing 
high-quality research involving multiple methodologies, including quantitative, qualitative, single-subject, and 
mixed-methods designs. This interest extends to other forms of scholarship, especially meta-analyses, but 
includes conceptual, methodological, and integrative reviews of the literature too. The world of educational 
psychology is very diverse in its interest and approaches to scholarship. We hope that during our watch, JEP 
can become even better at capturing this complexity. 

For an editor, it is a bit risky to specify the types of papers you do not plan to publish, as exceptions may 
be made along the way. With that said, we do not plan to publish survey studies that are based on convenience 
or unrepresentative samples of respondents. Nor do we plan to publish studies that primarily focus on creating 
and validating a test or specific measures. Although such studies are important and needed, a variety of 
journals serve this purpose. Finally, replication studies are important to science and the field. Replications of 
new findings contained in a single article are encouraged. The journal might also look favorably on multiple 
replications of a prior study in a submitted paper. For the most part, though, a new study needs to do more 
than simply replicate a previously published investigation. It needs to make an important extension to 
understanding of the phenomena under investigation. Thus, systematic replications that both reproduce and 
extend the original research are encouraged. 

As I bring this editorial to a close, I want to indicate how pleased I am that JEP has become so international. 
This has become increasingly evident over the last 20 years in terms of editorial board members, submissions, 
and publications. This is a trend I and my team plan to nurture. We further invite academics who are interested 
in reviewing for the journal to contact me (steve.graham@asu.edu): Send your vita and tell me about your 
expertise. In the best interest of student training, we encourage reviewers to invite doctoral students to 
complete reviews with them. We acknowledge these students’ contribution in the final issue of each volume. 
Such apprenticeships are essential to growing a healthy field. 

In closing, I have one last thought to share with you. If you send us a paper and we publish it, please accept 
our thanks and send us more papers! If we do not publish it, the same sentiment applies. 


Steve Graham, Editor. 
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Editorial 


I accepted the editorship of Journal of Educational Psychology in 2008 with the hopes of achieving several 
goals. Concrete missions motivate an editor to devote the necessary time and energy to editing over a 
nontrivial span of one’s career. Now that I have completed my 6-year term as editor, I can reflect on the extent 
to which my goals were achieved. 

The first, rather obvious, goal was to have the journal grow with an abundance of high-quality research. The 
number of submissions indeed grew by approximately 50% to 550 new submissions per year and the pages 
increased by 33% to 1,200 printed pages per volume. During the same period, the rejection rate increased to 
83%. Thanks to the 11 associate editors, over 100 colleagues on the editorial board, and thousands of ad hoc 
reviewers, the quality of the reviews maintained the historically high standards of this journal. Of course, these 
numbers are reassuring, but what about the more substantive goals? 

The second goal was to encourage studies with objective measures of learning, achievement, social 
interaction, motivation, emotion, and other psychological constructs relevant to education. Objective measures 
are grounded in behavior, performance, cognitive tasks, objective tests, and neuroscience. Psychological rating 
scales and other forms of self-report are important sources of data when combined with objective measures, 
but exclusive reliance on self-report data is a flimsy foundation for science in the 21st century. The associate 
editors shared this perspective and applied rigorous measurement standards in the review process. Our 
perspective did disappoint some authors who had built a cottage industry of administering psychological tests 
with self-report measures to samples of participants who were available (convenience samples rather than 
representative samples) and reporting correlations among these self-report measures (typically without 
replication). However, serious complaints about our position were surprisingly rare and hopefully reflected 
general improvements in measurement standards in the field of educational psychology. 

A third goal was to increase coverage of research with computer technologies. Educational technologies 
have had a revolutionary effect on education during the last decade; therefore, it is important for the journal 
to capture those trends. Computers can reliably collect objective and self-report measures at a fine-grain level, 
systematically implement pedagogical interventions, and quickly analyze data. The journal did have an 
increase in publications with educational technologies over the course of my editorship. This was partly 
reflected in a special issue on advanced learning technologies (the only special issue under my editorship). The 
hope is that this journal continues to encourage submissions with computer technologies, such as multimedia, 
intelligent tutoring systems, conversational agents, social media, educational games, distributed learning 
environments, and conventional computer-based training. 

A fourth goal was to encourage studies that coordinate educational data mining methodologies with 
theory-based evidence-centered design and measures that satisfy psychometric standards. For example, there 
was a special section of an issue that focused on analyzing computer logs in large-scale assessments. Hundreds 
of observations per hour can be tracked by computers, including response times. Such rich data can be mined 
to discover patterns of data that might not be anticipated by researchers a priori, but they can be linked to 
psychological constructs and thereby advance educational theory. Progress on this fourth goal did not progress 
to my satisfaction, but it will hopefully evolve in future years. 

The most difficult challenge as editor was in finding ways to handle a large number of manuscripts with 
sophisticated quantitative techniques that stretched beyond the conventional analysis of variance, multiple 
regression, and nonparametric statistical analyses. There were not enough colleagues with exyertise in 
advanced statistics to handle the load. Indeed, universities are not graduating a sufficient number of doctoral 
degrees in quantitative areas of the social sciences to meet the demands of several areas of psychology. The 
challenge was compounded by the fact that many of the manuscripts with advanced statistical techniques were 
quite lengthy; therefore, colleagues were prone to decline reviewing them (even after they initially agreed to 
review the manuscripts). Consequently, some manuscripts required more than 3 months for review and there 
was high turnover in associate editors with sophisticated quantitative expertise. We need to find ways to fill 
the serious expertise gap in advanced research designs and statistics. 

In closing, I would like to thank Jean Edgar, my chief editorial assistant. She assisted me with compassion 
and enthusiasm for over 6 years. 

Art Graesser, Editor 
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[CCSSO], 2010) bring unprecedented attention to the nature of 
texts that students read. The goal of the Standards is for high 
school graduates to be well prepared for college and workplace 
careers. The ability to read college-and-workplace texts plays a 
prominent role in the Standards for that preparation. Citing prior 
evidence of a current-day gap between the text-complexity levels 
at high-school graduation and college and workplace (e.g., ACT, 
2006; Williamson, 2008), the CCSS authors set a challenging 
standard for all students to be able to “comprehend texts of 
steadily increasing complexity as they progress through school 
...” (NGA & CCSSO, 2010, Appendix A, p. 2). The foundation 
for students’ ability to read increasingly complex texts begins in 
early reading exposure, and considerable controversy and debate 
has focused attention on the potential impact of the text- 
complexity Standard for young readers (e.g., Hiebert, 2012; Mes- 
mer, Cunningham, & Hiebert, 2012). As educators attempt to 
support youngsters to read increasingly complex texts, early- 
grades teachers need a sound understanding of what makes texts 
more or less complex for young students who are beginning to 
learn to read. An empirically based understanding of text com- 
plexity for early-grades readers is critical for practical reasons and 
should also contribute to development of theoretical modeling of 
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text complexity. The purpose of the present study was to explore 
text characteristics specifically in relation to early-grades text 
complexity. The research questions addressed in the study were as 
follows: (a) Which text characteristics are most important for 
early-grades text complexity? (b) Is there interplay of text charac- 
teristics in relation to text complexity, and if there is, can any 
aspects of the interplay be described? The research questions were 
addressed using computer-based analysis of texts. The present 
study makes an additional contribution to the educational research 
literature in that a statistical approach and methodological se- 
quence unique in the educational research literature were used— 
random forest regression in conjunction with a machine-learning 
research paradigm. 


What Is Text Complexity? 


On a broad stage, in science writ large, “complexity” has over- 
taken “parsimony” as a focal interest in both physical and social 
sciences. Scientists increasingly aim to understand complexity as it 
exists naturally in the world—as opposed to more traditional 
efforts to reduce natural occurrences to some fundamental sim- 
plicity (e.g., Bar-Yam, 1997). The seminal philosophical definition 
of complexity may be attributed to Rescher (1998, p. 1)—‘“Com- 
plexity is . . . a matter of the number and variety of an item’s 
constituent elements and of the elaborateness of their interrela- 
tional structure, be it organizational or operational.” Complexity 
theory suggests that although the complexity of some objects, 
events, or- actions may not be fully understood, three essential 
elements of complex systems can be pinpointed and characterized 
(Bar-Yam, 1997; Kauffman, 1995). First, in general, complex 
systems involve a large number of mutually interacting parts, but 
even a small number of interacting components can behave in 
complex ways (Albert & Barabasi, 2002; Bar-Yam, 1997). When 
complexity occurs, a reciprocal relationship exists between parts 
and wholes. Ensembles are influenced by the distinct elements, but 
the distinct elements are also influenced by the whole of the 
ensemble (Merlini Barbaresi, 2003). Second, however, there is 
usually a limit to the number of parts the researcher has primary 
interest in, and paradoxically, for practical and research purposes, 
often summative description of a complicated system may require 
description as a particular few-part system where the few-part 
system retains the character of the whole (Bar-Yam, 1997). Third, 
most complex systems are purposive, and there is often a sense in 
which the systems are engineered (Bar-Yam, 1997). 

Following suit, for the present study, a dynamic systems defi- 
nition of text complexity was embraced. First, “text” is defined as 
“. . an organized unit, whose various components or levels are 
recognized to give autonomous contributions to the global effect 
...” (Merlini Barbaresi, 2002, p. 120). Second, text complexity is 
“., . a dynamic configuration resulting from the contributions of 
complex phenomena, as they occur at the various text levels” and 
across text levels (Merlini Barbaresi, 2003, p. 23). The CCSS 
text-complexity definition further undergirded the present work— 
text complexity is “the inherent difficulty of reading and compre- 
hending text combined with consideration of the reader and task 
variables” (NGA & CCSSO, 2010, Appendix A, Glossary of Key 
Terms, p. 43). The Common Core definition is embedded in a 
systems outlook in which complexity arises among reader, printed 
text, and situation during the whole of a reading act. That is, when 


engaged in a specific reading encounter, complexity is in some 
degree relative to an individual and to contextual characteristics 
(such as age or developmental reading level or degree of teacher 
support while reading). Concomitantly, complexity of particular 
texts is relative to populations of readers at different ages or 
reading ability levels (cf. Kusters, 2008, and Miestamo, 2009, on 
relative vs. absolute complexity; van der Sluis & van den Broek, 
2010). That is, when viewed on a continuum of complexity in 
relation to many readers’ developmental levels, texts have an 
emergent nature and can be assigned a “complexity level” to 
situate them on an entire continuum. The stance is consistent with 
theories of reading dating back to Rosenblatt’s expositions on 
reading as transactional (Rosenblatt, 1938, 2005) and Rumelhart’s 
(1985) explanation of reading as interactive and, more recently, to 
the widely accepted Rand Reading Study Group model of reading 
(Snow, 2002). For example, in the Rand Reading Study Group 
model, text is squarely rooted in an interaction with the reader as 
reading happens during an activity within a particular social con- 
text. The stance is also consistent with Mesmer et al.’s (2012) 
exposition of early-grades text characteristics in that they also 
address text complexity as situated within individual and social/ 
instructional contexts. 

Commensurate with the three essential elements named above 
for complex systems, for the present study, we assumed (a) that 
early-grades texts are complex systems consisting of many mutu- 
ally interacting characteristics and ensembles of characteristics 
that interplay to impact text complexity, and the characteristics can 
be quantitatively measured; (b) to begin to understand the text- 
characteristic functioning, we would need to consider an organi- 
zational scheme for the characteristics and explore whether and 
how characteristics interact; and (c) the complexity of early-grades 
texts purposefully exists (1.e., it is in some sense engineered) to 
support young children to learn to read with as much ease as 
possible. As well, exploration of interplay among text character- 
istics would be essential to successful explanation of text com- 
plexity. 


Which Text Characteristics Might Matter Most for 
Early-Grades Text Complexity? 


An “optimal” text is one in which text characteristics are con- 
figured such that readers can construct meaning while engaged 
with the text with the greatest amount of ease and the greatest 
depth of processing (cf. Merlini Barbaresi, 2003, on optimality 
theory and Juola, 2003, on the necessity of complex systems to 
reflect “process,” including cognitive process). Text authors may 
consciously or unconsciously use optimality when creating texts 
for particular audiences. Generally, authors must make trade-off 
choices between favoring readers’ processing ease (efficiency) and 
readers’ processing depth (effectiveness), and the point of balance 
between the two is constrained by intended uses of the text, 
including intended readers of the text (cf. Merlini Barbaresi, 2003, 
who references the trade-offs, but in recognition of how an author 
develops a text, rather than in reference to readers/audience). For 
example, in content-laden disciplinary texts, readers’ processing 
depth (effectiveness) is often given preference over readers’ pro- 
cessing ease (efficiency). Early-grades texts are generally created 
to heighten certain factors related to children’s processing ease 
(such as word decodability), while simultaneously requiring a 
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relatively low level of processing depth, that is, requiring little 
effort for meaning creation. Further, some evidence suggests that 
text characteristics do influence the early word-reading strategies 
that young children develop (Compton, Appleton, & Hosp, 2004; 
Juel & Roper-Schneider, 1985). For example, in one study, when 
tested on novel words, young students who read highly decodable 
texts outperformed other students who primarily read texts with 
repetition of high-frequency words (Juel & Roper-Schneider, 
1985). 

The concept of optimality suggests that different text character- 
istics might be more important at certain levels of students’ read- 
ing development than at others, leading directly to consideration of 
which characteristics of text might be related to the development 
of students’ emergent reading ability. A deep research base sug- 
gests that, while meaning creation is at the heart of learning to 
read, “cracking the code” requires focal effort for beginning read- 
ers, and critical cognitive factors inherent in the early learning-to- 
read phase are development of phonological awareness and word 
recognition (e.g., Adams, 1990; Fitzgerald & Shanahan, 2000). As 
a result, hypothetical critical text characteristics that would support 
early word-reading development are, for example, texts that are 
composed of: repetition of simple words, which likely facilitates 
sight word development and orthographic-pattern knowledge (e.g., 
Metsala, 1999; Vadasy, Sanders, & Peyton, 2005); words with 
relatively simple orthographic configurations, which facilitates 
orthographic-pattern knowledge (e.g., Bowers & Wolf, 1993); 
rhyming words, which may advance phonological awareness (e.g., 
Adams, 1990); words that are familiar in meaning in oral language, 
which likely reduce challenges to meaning creation while reading, 
permitting more attention to word recognition (e.g., Muter, Hulme, 
Snowling, & Stevenson, 2004); and repeated refrains or repetitive 
phrases, which likely reinforce phonological awareness and devel- 
opment of sight words along with varied word recognition strate- 
gies such as using context to make guesses at unknown words 
(e.g., Ehri & McCormick, 1998; cf. Bazzanella, 2011, on multiple 
functions of repetition in oral discourse, including cognitive facil- 
itation). Moreover, inclusion of several types of text-characteristic 
support might exponentially boost students’ ease of learning about 
code-related facets of reading. 

Consequently, to describe early-grades text complexity, it is 
theoretically necessary to consider several text characteristics at 
multiple linguistic levels (Graesser & McNamara, 2011; Graesser, 
McNamara, & Kulikowich, 2011; Kintsch, 1998; Snow, 2002). 
Studying linguistic levels in text complexity is compatible with 
research that suggests that hierarchy is one of the central architec- 
tures of complexity (Simon, 1962). The research base supporting 
the importance of multiple levels of texts characteristics for early 
phases of learning to read is extensive and comprehensive (Mes- 
mer et al., 2012). Only illustrative citations are provided in the 
following summary (which compares to Mesmer et al., 2012). 

Beginning readers learn to attach specific sounds to graphemes 
and vice versa (e.g., Fitzgerald & Shanahan, 2000), and the re- 
search base on the importance of phonological activity is extensive 
(e.g., Schatschneider, Fletcher, Francis, Carlson, & Foorman, 
2004). Other aspects of word-level features have also received 
wide attention in early-grades texts. In particular, word structure 
(how a word is configured) and word frequency (the degree to 
which a word occurs in spoken or written language) have deep 
research bases. With regard to word structure, letter-sound regu- 


larity in words is highlighted in decodable and linguistic texts 
where significant attention is paid to word rimes and bigrams and 
trigrams (two and three letter units). Such texts have been shown 
to have positive impact on oral reading accuracy, but not on 
comprehension or other global measures of reading (e.g., Compton 
et al., 2004). With regard to word familiarity, many early grades 
texts are designed to include repetition of high-frequency words. 
Children’s accuracy and speed of recognition is influenced by 
word frequency (e.g., Howes & Solomon, 1951). 

The importance of knowing key meanings in texts has been well 
substantiated in relation to its impact on comprehension (e.g., 
Stanovich, 1986), and some evidence suggests that young students 
may benefit from texts with easier and more familiar vocabulary 
(e.g., Hiebert & Fisher, 2007). However, current-day early-grades 
texts may contain a fairly large amount of challenging word 
meanings (e.g., Foorman, Francis, Davidson, Harm, & Griffin, 
2004). In general, words that occur with, higher frequency are 
processed more quickly and tend to be associated with networks of 
knowledge (Graesser et al., 2011). In addition to word frequency, 
other word meaning factors, including imageability, concreteness, 
and age of word acquisition, have been shown to be significant for 
students’ comprehension and/or word recognition during reading 
(e.g., Woolams, 2005). 

Within-sentence syntax is primarily related to the ease or chal- 
lenge for creating meaning while reading, as opposed to word 
recognition (Mesmer et al., 2012). The importance of within- 
sentence syntax in texts is likely due to the extent to which 
complexity within a sentence places demands on children’s work- 
ing memory (Graesser et al., 2011). 

Discourse-level text characteristics impact aspects of reading in 
general (Graesser et al., 2011) and are likely to be related to early 
reading. For example, referential cohesion—occasions when a 
noun, pronoun, or noun phrase reference another element in the 
text—has been shown to be related to reading time and compre- 
hension (e.g., McNamara & Kintsch, 1996). More cohesive texts 
tend to facilitate comprehension, likely because they support men- 
tal model building (Kintsch, 1998). It has long been known that 
even young readers have expectations for story structures that they 
tend to use to guide comprehension, although young students tend 
to reveal such expectations to a lesser extent than do older students 
(e.g., Mandler & Johnson, 1977; Whaley, 1981). As well, better 
readers make use of informational text structures for comprehen- 
sion and recall (Britton, Glynn, Meyer, & Penland, 1982). A final 
potential discourse-level text characteristic is genre, generally con- 
sidered by linguists and discourse analysts to be a slippery con- 
struct (Rudrum, 2005; Steen, 1999). However, questions remain 
about the relationship between genres and text complexity, espe- 
cially with regard to identification of various genres according to 
specific text features (e.g., Mesmer et al., 2012). For instance, 
findings on the view that narratives are easier texts than other 
genres are mixed (e.g., Langer et al., 1995, supported the view, 
while Duke, 2000, did not). 

In addition to considering which sorts of text characteristics 
might be especially important for examining early-grades text 
complexity, it is essential to embrace potential interplay among 
various text characteristics. Theoretically, the emergent nature of 
text complexity is in part due to the challenge level of the constit- 
uent elements, but it may also develop through the interplay of the 
elements (Merlini Barbaresi, 2003). Complex systems tend to have 
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subsystems that may conflict depending on their “targets,” and to 
attain a successful result, subsystems need to co-operate toward a 
compromise solution (Merlini Barbaresi, 2003; cf. Gamson, Lu, & 
Eckert, 2013, on text characteristic “trade-offs”; Gervasi & Amb- 
riola, 2003). That is, text characteristics at different linguistic 
levels may have conflicting impact on readers (their “targets”). For 
instance, an author may choose to write a text for second-grade 
students about a content-area topic, such as sound waves, requiring 
heavily laden vocabulary meanings that may make the text quite 
complex for young readers. But the words may also be technically 
challenging for word recognition. As an ensemble, difficult vo- 
cabulary meanings coupled with high decoding demand can mag- 
nify complexity exponentially. The author might consider ways of 
lessening the burden on the reader by employing other text-level 
characteristics, such as using a within-sentence syntactic pattern 
that is generally familiar to typically developing second-grade 
students or inserting parenthetical definitions after difficult word 
meanings, or at the discourse level, placing main ideas first in 
paragraphs. As another example, there is evidence that concrete- 
ness/abstractness, or imageability interacts with structural com- 
plexity and word familiarity to influence readers’ word recognition 
(e.g., Schwanenflugel & Akin, 1994). In short, constellations of 
co-occurring linguistic characteristics may contribute to variation 
in text complexity (Biber, 1988). 


Measuring Text Complexity Quantitatively 


Several established computerized systems address text complexity 
beyond the early grades through quantitative measurement. They 
are summarized here to provide context for the present study: 
readability formulae that are typically focused on word frequency, 
word length, and/or sentence length (e.g., Renaissance Learning, 
2014; Klare, 1974; The REAP project, n.d.); conjoint measurement 
systems that relate students’ reading levels to text-complexity 
levels on the same scale, identifying collections of text character- 
istics (typically a small set such as word frequency and within- 
sentence syntax) that serve as “best predictors” of text complexity 
levels (e.g., the Lexile Framework for Reading [Stenner, Burdick, 
Sanford, & Burdick, 2006] and Degrees of Reading Power [DRP; 
Koslin, Zeno, & Koslin, 1987]); and natural language processing 
analyses involving multiple text characteristics (e.g., Coh-Metrix 
[Graesser et al., 2011; McNamara, Graesser, McCarthy, & Cai, 
2014]; Reading Maturity Metric [Pearson Education, 2014]; and 
SourceRater [Sheehan, Kostin, Futagi, & Flor, 2010]). The sys- 
tems may be differentiated in the following ways: (a) All measures 
except Coh-Metrix provide a single text-complexity quantitative 
judgment of texts’ complexity levels. Some do so using grade 
levels; others use their own leveling system. (b) Only Lexile and 
DRP measures are relational to readers, that is, they are originally 
based on individuals’ reading of the texts—except that the 
SourceRater measure uses an “inheritance principle” in which the 
original outcome variable used in the predictor equation was 
educators’/publishers’ assignment of text grade levels. Other mea- 
sures examine text characteristics and then use a form of dimen- 
sion reduction, such as principal components analysis to determine 
essential components of text complexity. (c) Coh-Metrix and 
SourceRater quantify the broadest number of text characteristics 
and include discourse-level text characteristics in their analyses. 


Across the various systems, the most common text characteris- 
tics that are best predictors of text complexity are word familiarity, 
word length, sentence syntax, and/or sentence length. The 
SourceRater system involves eight dimensions—syntactic com- 
plexity, vocabulary difficulty, level of abstractness, referential 
cohesion, connective cohesion, degree of academic orientation, 
degree of narrative orientation, and paragraph structure. Coh- 
Metrix employs 53 text-characteristic measures reduced to five 
dimensions—narrativity, syntactic simplicity, word concreteness, 
referential cohesion, and deep cohesion. Importantly, none of the 
currently existing common metrics specifically provides explana- 
tion of what constitutes early-grades text complexity (cf. Graesser 
et al., 2011, and van der Sluis & van den Broek, 2010). 


Summary 


As the Common Core text-complexity standard is implemented 
in schools, educators and researchers alike need an empirically 
based understanding of text complexity for early-grades readers. 
Complexity theory provides a foundation for studying early-grades 
text complexity. Key principles of complex systems are that they 
involve a large number of mutually interacting parts; interplay 
among components can be locally, rather than globally, relevant; 
they often may be described by hierarchical organization; and they 
are purposive, that is engineered for particular purposes. A rela- 
tional outlook on text complexity implies complexity of particular 
texts is relative to particular individuals, reading occasions, and 
developmental reading levels. However, theoretically, texts have 
an emergent “developmental” complexity such that they can be 
assigned a complexity level in relation to an entire continuum of 
complexity. Using an “optimality” concept in conjunction with 
what is known about critical cognitive factors for the early 
learning-to-read phase and prior findings about the importance of 
selected text characteristics during early reading, not only should 
many text characteristics at multiple linguistic levels be investi- 
gated, but interplay among text characteristics should be hypoth- 
esized. Few of the prior text-complexity measurement systems 
encompass discourse-level characteristics, few address text com- 
plexity as relational within either specific reading occasion or in 
the sense of student reading-ability development, none addresses 
the interplay or potential interactive nature of text characteristics, 
and importantly, none specifically addresses early-grades text 
complexity. In the present study, a relational frame is used to 
explore text characteristics that matter most for early-grades texts, 
and the potential interplay of text characteristics is naturally ac- 
counted for through use of a statistical modeling technique that is 
prevalent in many fields but novel to educational research, that is, 
random forest regression. 


Method 


Overview 


Three hundred fifty primary-grades texts were selected and 
digitized. Twenty-two text-characteristics were identified at four 
linguistic levels. Multiple computerized variable operationaliza- 
tions were created for each of the 22 text characteristics, totaling 
238 variables. The variables were automated so that a computer 
could examine the digitized texts and produce text-complexity 
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measures for each operationalization. Analyses were conducted 
using a logical analytical progression typically used in machine- 
learning research (Mohri, Rostamizadeh, & Talwalkar, 2012). 
Three phases of analyses were: variable selection to find a subset 
of the most important text characteristics out of the 238 operation- 
alizations; using 80% of the texts, “training” a random forest 
regression model (Breiman, 2001a) of the most important text 
characteristics associated with text-complexity level; and validat- 
ing the model on a 20% “hold-out” set of texts. Follow-up analyses 
were done to explore the data structure. 


Texts 


Three hundred fifty texts (148,068 words in total) intended for 
kindergarten through second grade constituted the text base. An 
existing larger corpus of early-grades texts was made available for 
the study (MetaMetrics, n.d.-a), and maximum-variation purposive 
selection (Patton, 1990) was used to choose texts from the corpus. 
As well, 18 kindergarten through second-grade Common Core 
State Standards (NGA & CCSSO, 2010, Appendix B) exemplar 
texts (that were not present in the available corpus) were pur- 
chased. The goal of maximum-variation purposive selection was to 
ensure comprehensive representation of a wide variety of early- 
grades text types, text levels, and publishers that currently exist in 
U.S. early-grades classrooms. We chose 350 texts for two main 
reasons: (a) to include a sufficiently large number of texts that 
would adequately represent the domain and to ensure sound sta- 
tistical analyses (following the suggested sample size in 
Heldsinger & Humphry, 2010) and (b) to include a manageable set 
of texts to accomplish teacher and student tasks needed for devel- 
opment of the text-complexity-level variable (described below in 
the section, “Text-Complexity Level”). All texts were reproduced 
in authentic form (including pictures) and digitized. 

Six categories for commonly occurring early-grades text types 
for independent reading were determined: code-based (decodable, 
phonics), whole-word (texts that include many words that appear 
in early-grades texts with high frequency), trade books (books 
commonly sold for library, supplemental materials for classroom 
use, or private sale), leveled books (texts that are sequenced in 
difficulty level), texts of assessments, and other (e.g., label books). 
The first four text types had been previously identified in studies 
of classroom texts as reasonably comprehensive categories of 
early-grades texts intended for independent reading in primary- 
grade classrooms (Aukerman, 1984; Hiebert, 2011). The last two 
categories were included because texts appearing in assessments 
also commonly occur in early-grades classrooms, and texts of 
assessments may become even more prominent with the advent of 
the Common Core State Standards (NGA & CCSSO, 2010). Some 
commonly occurring early-grades texts, such as label books, do 
not fit well into the previous categories. The first four category 
labels are common terms used by educators and publishers (Mes- 
mer, 2006). 

It was not possible to consider proportional representation of 
types as they exist in United States classrooms because, to our 
knowledge, there is no direct evidence of the degree to which 
different categories of early-grades texts are present or used in 
U.S. classrooms, although at least one survey of U.S. primary- 
grades teachers suggested that use of the first four categories of 
texts is widespread (Mesmer, 2006). Consequently, we selected 


“prototypes” to represent each category (Hiebert & Pearson, 
2010), and, where series existed, texts were sampled from the 
range in the series. In reality, many early-grades texts fall into two 
or more of the category types (Mesmer, 2006). For example 
code-based texts are often “leveled.” However, for our purposes of 
ensuring wide representation of text types, each text was assigned 
to a single category. If a text was labeled “decodable” or “phonics” 
by the publisher, it was labeled “code-based.” If a publisher 
characterized a text as primarily attending to high-frequency words 
or sight words, it was labeled “whole word.” A text was labeled 
“trade book” if it was available in the trade market and not just in 
the school market, and it was not identified by the publisher as 
decodable, phonics, or high-frequency. A text was labeled “lev- 
eled” if the text was assigned a level (other than grade level) by the 
publisher and was not labeled “decodable,” “phonics,” or “high 
frequency.” 

Text levels were determined by using publisher-designated 
grade, level, or age ranges. Texts were labeled: easy if they were 
designated kindergarten, kindergarten levels (as noted on publisher 
websites), or typical ages for kindergarten; moderately hard if 
designated first grade, first-grade levels, or first-grade ages; and 
hard if designated second grade, second-grade levels, or second- 
grade ages. 

Thirty-two publishers were represented in the 350 texts, ranging 
from three to 15 different publishers for each of five of the six text 
types, with one publisher for the text-of-assessment type. 

Text genre (narrative, informational, hybrid) was determined 
using a modification of Duke’s (2000) procedures. Two primary 
text characteristics were used to discern narrative, informational, 
and hybrid text—purpose and textual attributes. Narrative text was 
defined as follows (Duke, 2000; Rudrum, 2005): It is a series or 
sequence of events, with the intention or purpose to evoke an 
element of reader response. It tells a “story” and/or has characters, 
places events, and things that are familiar and is closely related to 
oral conversation. Informational was defined as text that conveys 
information about the natural or social world and is typically 
written by someone who is presumed to know the information to 
someone who is presumed to not know it (Duke, 2000). Textual 
attributes for narratives included for instance, events, actions with 
temporal or causal links, characters, dialogue. Textual attributes 
for informational texts included for example facts, timeless verb 
constructions, technical vocabulary, descriptions of attributes, def- 
initions. A set of rules modified from Duke (2000) was devised for 
determining genre classification, using a decision tree process that 
began by determining the purpose of the book and then addressing 
attributes of the text. Interclassifier reliability between two indi- 
viduals for 20% of the 350 books was .96. 

Finally, the text corpus could be described as follows. Caution 
should be exercised when interpreting the following figures for the 
text categories—again, because the categories are not mutually 
exclusive. Rather, using the publisher designation in concert with 
the researcher-devised system described above for when a text 
could belong to two or more categories, 41% of the texts were 
leveled, 17% were code-based, 15% were trade books, 10% were 
whole-word, 9% were texts of tests, and 8% were other. Approx- 
imately 36% of the 350 texts were labeled easiest, 37% moderately 
hard, and 27% hardest. Sixty-six percent were labeled narrative, 
24% informational, and 10% hybrid or other. 
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Variables 


Text-complexity level. The outcome variable was early- 
reader text-complexity level measured using a continuous, devel- 
opmental scale, with scores ranging from 0 to 100. An overview of 
the scale-building procedures is as follows. (Further details of the 
procedures are provided in the online supplemental materials.) 
Because text complexity was defined at the intersection of printed 
texts with students reading them for particular purposes and doing 
particular tasks, a multiple-perspective measure of text complexity 
was created using student responses during a reading task and 
teachers’ ordering of texts according to complexity. In doing so, 
we represented students and teachers as readers, and teachers as 
important context for student reading instruction, as well as two 
different tasks in the final measure. Then the magnitude and 
strength of the association between the two logit scales (one from 
student responses and one from teachers’ text ordering) was ex- 
amined, and to arrive at a single scale, a linear equating linking 
procedure (Kolen & Brennan, 2004) was used to bring the student 
results onto a common scale with the teacher results. Finally, for 
ease of interpretability, the logit scale was linearly transformed to 
a 0 to 100 scale. 

In a first substudy, through Rasch modeling (Bond & Fox, 2007) 
a text-complexity logit scale was created from the interface of 
1,258 children from 10 U.S. states reading passages from a subset 
of the 350 texts and responding to a maze task (see Shin, Deno, & 
Espin, 2000, for task validity). Cronbach’s alpha estimates of 
reliability for all test forms ranged from .85 to .96. Also, dimen- 
sionality assessments for text genre and for differential text order- 
ing according to student ethnicity, gender, or free-reduced-lunch 
status suggested no evidence of measurement multidimensionality. 
After creation of the logit scale, each text in the subset was 
assigned a text-complexity level. 

In a second substudy, also through Rasch modeling, a second 
text-complexity logit scale was created from 90 practicing 
primary-grades teachers’ (from 33 states and 75 school districts) 
evaluations of texts’ complexity. Teachers ordered random pairs of 
the 350 texts seen side by side on a computer screen. For each pair, 
teachers clicked on the text they thought was more complex. 
Determined by the separation index method (Wright & Stone, 
1999), measurement reliability was .99. After creation of the logit 
scale, each of the 350 texts was assigned a text-complexity level. 

Next, the correlation between the two logit scales (NV = 89 texts) 
was .79 (p < .01), suggesting that the texts ordered on text 
complexity similarly whether teachers or students were involved. 
The relatively high correlation was also evidence of concurrent 
validity in that it suggested that the two logit scales were measur- 
ing the same construct. Consequently, a linking equating proce- 
dure was used to link the two logit scales (Kolen & Brennan, 
2004). Finally, a linear transformation was done resulting in mea- 
sures that could range from 0 to 100 on a text-complexity scale. 
That is, the 350 texts ordered by teachers could be assigned a 
measure from 0 to 100, and the texts read by students could be 
assigned a measure from 0 to 100. 

Text characteristics and their variable operationalizations. 
Twenty-two text characteristics were identified at four linguistic 
levels—sounds in words, words, within-sentence syntax, and 
across-sentences or discourse level. Discourse-level characteristics 
captured repetition, redundancy, and patterning (of letters, words, 


phrases, and/or sentences) that occurred in the texts. In an effort to 
capture a wide variety of ways of representing the text character- 
istics, multiple computerized variable operationalizations were 
created for many of the 22 text characteristics, totaling 238 vari- 
able operationalizations. The rationale for including as many vari- 
able operationalizations as possible was that different metrics may 
pinpoint different aspects of a text characteristic (Baca-Garcia et 
al., 2007). By including as many operationalizations as possible, 
the chances of capturing critical text characteristics for text com- 
plexity were increased. 

Table 1 shows the 22 text characteristics according to linguistic 
level, along with definitions, the number of variable operational- 
izations for each, and selected examples of operationalizations and 
their possible score ranges and interpretations. A complete list and 
description of operationalizations is available in the online supple- 
mental materials. 

Operationalizations were accomplished using four logical 
approaches. First, several types of computational metrics were 
considered. In addition to traditional metrics, such as counts, 
mean, and percentage, six specialized computational linguistic 
techniques were used to produce other metrics. One specialized 
computational linguistic technique was distributional semantics 
(Landauer & Dumais, 1997), a method for quantifying semantic 
similarities between linguistic items. Three additional specialized 
computational linguistics techniques were: part-of-speech tagging 
(Collins, 2002); syntactic parsing (Sleator & Temperley, 1991); 
and a Levenshtein (1965) metric, which gauges the minimum 
number of substitutions, insertions, or deletions required to turn 
one linguistic unit (e.g., a written word) into another. Also, two 
unique metrics that specifically capture text characteristics in re- 
lation to student readers were applied to all of the sounds-in-words 
variables and most of the word-level variables—types- (unique 
words in a text) -as-test and words- (all words in a text) -as-test. 
Both metrics treat the text characteristic of interest as test items, 
while considering a potential student who might be reading the text 
to have a trait level for the characteristic of interest. Both represent 
an alternative way to measure central tendency for a distribution of 
values, and both are more impacted by outliers than an average. 
For instance, for a types-as-test operationalization for syllables 
(the text characteristic of interest) in a text, the unique words in the 
text are listed, and the number of syllables is counted in each word. 
Then one might hypothesize that a student has a “syllable-level 
reading ability” for reading the text. The unique words (types) 
form a test for measuring a student’s ability to use syllables to read 
the text. Each unique word is given an item difficulty level that is 
the number of syllables in the word. A target level of hypothetical 
student performance is set (50%, 75%, 100% of the items pre- 
dicted to be correct), and then using Rasch modeling (Bond & Fox, 
2007) the metric determines what level of reader ability would be 
expected to attain the percentage that was set. The overall metric 
(derived from a mathematical formula) therefore summarizes a 
“syllable” level of complexity for the text. 

A second logical approach was that discourse text characteris- 
tics were systematically treated as follows. The main focus of 
discourse-level variables was to capture linkages among words and 
meanings in text (e.g., cohesion), redundancy, and patterning that 
occur across a whole text or parts of text but more than just within 
sentences. For each discourse text characteristic, first, variable 
operationalizations were considered that would reflect a lexical 
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Table 1 
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Text Characteristics by Linguistic Level, Definition, Possible Score Range for Examples of Operationalizations, and Number of 


Operationalizations With Examples 





Linguistic level Text characteristic 


Definition (source) 


Possible score range for 


examples 


Operationalization 
example (number of 
operationalizations) 





Sounds in words Number of phonemes in 


words 


Phonemic Levenshtein 
distance 


Mean internal phonemic 
predictability 


Word structure Decoding demand 


Orthographic Levenshtein 
distance 


Number of syllables in words 


Mean internal orthographic 
predictability 


Sight words 


Smallest unit of sound. (The 


MRC Psycholinguistic 
Database provides phoneme 
values for words; Coltheart, 
1981.) 


The degree to which co-occurring 


phonemes exist across words. 
(Levenshtein Distance is a 
standard computer metric of 
string edit distance which 
gauges the minimum number 
of substitution, insertion, or 
deletion operations to turn one 
word into another. Measures 
phonemic similarity across 
words for the 20 closest words; 
Levenshtein, 1965; Yarkoni, 
Balota, & Yap, 2008; cf. 
Kruskal, 1999; Nerbonne & 
Heeringa, 2001; Sanders & 
Chinn, 2009.) 


The degree to which phoneme 


collocations occur given the 
totality of the phoneme 
collocations in the particular 
text. (Words are converted to 
phonemes using the Carnegie 
Mellon University, n.d., 
Pronouncing Dictionary.) 


The decoding demand of words 


in the text (slight modification 
of Menon & Hiebert’s, 1999, 
decodability scale). 


See phonemic Levenshtein 


distance above. Orthographic 
Levenshtein distance measures 
orthographic similarity across 
words for the 20 closest words 
(Levenshtein, 1965; cf., 
Kruskal, 1999; Yarkoni, et al., 
2008). 


Number of syllables in words. 


(The MRC Psycholinguistic 
Database provides syllable 
values for words; Coltheart, 
1981.) 


The degree to which letter 


collocations occur given the 
totality of the letter 
collocations in the particular 
text (researcher computer 
coded; cf. Solso, Barbuto, & 
Juel, 1979). 


The most commonly occurring 


words in primary-grades texts 
(Dolch word list, n.d.; Fry 
Word List, 2012). 


1 (fewer phonemes in 


words, less complex) to 
less than 10 (more 
phonemes in words, 
more complex) 


1 (few words in closest 20 


share phonemes) to 3 
(more words in closest 
20 share phonemes) 


0 (fewer phoneme 


collocations are 
repeated in the text) to 
1 (more phoneme 
collocations are 
repeated in the text) 


1 (ess complex word 


structure) to 9 (most 
complex word structure) 


1 (fewer words in 20 


share orthographic 
patterns) to 3 (more 
orthographic patterns 
shared) 


1 (few words with many 


syllables) to 8 (more 
words with more 
syllables) 


0 (fewer orthographic 


trigrams are repeated in 
the text) to 1 (more are 
repeated in the text) 


0 (less complex) to 100 


(more complex) 


Mean number of phonemes 
for words in the text 
(14) 


Mean phonemic 
Levenshtein distance 20 
with stop list 50 most 
frequent words (14) 


Mean with text chunk size 
125 (4) 


Mean with stop list 50 
most frequent words 
(22) 


Mean orthographic 
Levenshtein distance 20 
with stop list 50 most 
frequent words (14) 


Types as test with stop list 
50 most frequent (ability 
at 75%) (18) 


Product of internal word 
values with chunk size 
125 (4) 


Percentage of words in a 
text that are on the 
Dolch Preprimer list = 
(13) 
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Linguistic level Text characteristic Definition (source) 


Word meaning Age of acquisition Age at which a word’s meaning 
is first known (Kuperman, 
Stadthagen-Gonzalez, & 


Brysbaert, 2012). 


Abstractness Degree to which the text contains 
words that reference general or 
complex concepts such as 
“honesty” and cannot be seen 
or imaged (Paivio, Yuille, & 
Madigan, 1968; updated by 
Coltheart, 1981). 

The inverse of the frequency with 
which a word appears in 
running text in a corpus of 
1.39 billion words from 93,000 
kindergarten through university 
texts normalized to equate to 
the frequencies in the Carroll, 
Davies, & Richman (1971) 
frequency 5 million word list 
(MetaMetrics, n.d.-b). 

Number of characters, words, 
unique words, or phrases in a 
sentence (researcher computer 
coded). 


Word rareness 


Syntax (within-sentence) Sentence length 


Grammar Link type, a linguistic convention 
that ties a word in a sentence 
to another word within the 
sentence. Differentiates 
between long sentences with 
many different syntactic 
relationships and long 
sentences with few syntactic 
relationships (Temperley et al., 
2012; Sleator & Temperly, 
1991).* 

Discourse (across sentences) Family 1: Intersentential 

complexity 
Linear edit distance The degree of word, phrase, and 
letter pattern repetition across 
adjacent sentences. The 
number of single character 
replacements required to turn 
one sentence into the next one 


(Levenshtein, 1965). 


Degree to which unique words in 
a first sentence are repeated in 
a following sentence, 
comparing sentence pairs 
sequentially (researcher 
computer coded). 

Words that indicate occurrence of 
cohesion in text. Five 
categories of cohesive devises 
between words in text work to 
hold a text together (cf. 
Halliday & Hasan, 1976; 
researcher devised beginning 
with words listed at Cohesion 
[linguistics], n.d.). 


Linear word overlap 


Cohesion triggers 


Family 2: Lexical/syntactic 
diversity 


Possible score range for 
examples 


1 to 25 in our study 
(lower means more of 
the words are known by 
younger readers and a 
higher score means 
fewer are known by 
younger readers) 

0 (less abstract, less 
complex) to 700 (more 
abstract, more complex) 


0.10 (less rare, less 
complex) to 6 (more 
rare, more complex) 


1 (fewer characters, 
words, unique words, or 
phrases) and above 1 
(more characters, 
words, unique words, or 
phrases) 

1 (fewer unique syntactic 
relationships, e.g., 
subject/object or noun- 
acting-as-adjective) to 
29 (more unique 
syntactic relationships 
within sentences; a 
larger number can occur 
when the text has one 
or more very long 
sentences) 


O (if all sentences are 
identical or there is 
only one sentence; lots 
of redundancy, less 
complex) to 
approximately 110 in 
our study (not much 
redundancy, more 
complex) 

0 (no words are repeated 
in a following sentence) 
to 24.56 in our study 
(many words are 
repeated in a following 
sentence) 

0 (no words on the 
cohesion trigger word 
list) to 39 in our study 
(many words on the 
cohesion trigger word 
list) 


Operationalization 
example (number of 
operationalizations) 


Age of acquisition types as 
test with stop list 50 
most frequent words 
(ability at 50%) (13) 


Degree of abstractness 
types as test with stop 
list 50 most frequent 
words (ability at 50%) 
(20) 


Word rareness types as test 
(ability at 90%) (14) 


Mean number of letters 
and spaces in sentences 


(6) 


Mean number of unique 
link types in sentences 


(1) 


Mean linear edit distance 


(4) 


Mean linear word overlap 
with slice 125 (6) 


Percentage of words in text 
that are on the cohesion 
trigger word list (1) 


(table continues) 
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Table 1 (continued) 





Linguistic level Text characteristic Definition (source) 


Possible score range for 
examples 


Operationalization 
example (number of 
operationalizations) 


eee eS 5556666066066 


Type-token ratio 
the number of unique words in 
a text divided by the total 
number of words in a text (cf. 
Malvern, Richards, Chipere, & 
Durn, 2009). 

Family 3: Phrase diversity 
Longest common string Degree of word, phrase, and 
letter pattern repetition across 
multiple sentences. Captures 
couplets and triplets (Gusfield, 
1997). 

Number of single character 
additions, deletions, or 
replacements required to turn 
one string (or sentence) into 
another (Kruskal, 1999; 
Levenshtein, 1965). 

Degree to which unique words in 
a first sentence are repeated in 
a following sentence 
comparing all possible pairs in 
a 125 slice (researcher 
computer coded). 


Edit distance 


Cartesian word overlap 


Family 4: Text density 
Information load Total information load in text. 
Denser texts have more 
information load, less 
redundancy, and are more 
complex. Also taps overlap of 
groups of co-occurring word 
repetition (researcher devised 
incorporating latent semantic 
analysis; Deerwester, Dumais, 
Furnas, Landauer, Harshman, 
1990; Landauer & Dumais, 
1997). 

Family 5: Noncompressibility 
Compression ratio The degree to which information 
in the text can be compressed. 
Novel text is less compressible 
(Burrows & Wheeler, 1994). 


An indicator of word diversity, or 


0 (few unique words) to 1 
(all words are unique) 


0 (a lot of overlap, a lot of 
redundancy, less 
complex) to 1 (not 
much overlap, more 
complex) 

O (the same characters are 
repeated, high 
redundancy) to 127 in 
our study (very few 
characters are repeated, 
low redundancy) 

4 (unique words not 
repeated much in a 
following sentence) to 6 
(unique words repeated 
more) 


0 (low density, low 
information load, lots of 
novel co-occurring 
word-group repetition) 
to 1 (denser text, higher 
information load, not as 
much novel co- 
occurring word-group 
repetition) 


0 (more compressible, 
more redundancy, less 
complex) to 1 (less 
compressible) 


* Definitions of all link types can be found at http://www.link.cs.cmu.edu/link/dict/summarize-links.html 


Type-token ratio with 
chunk 125 (2) 


Mean Cartesian longest 
common string 
percentage with slice 
125 (21) 


Mean Cartesian edit 
distance with slice 125 


(8) 


Percentage of mean 
Cartesian word overlap 
with slice 125 for part 
of speech (4) 


Normalized percentage 
reduction of information 
load across sentences for 
10 dimensions with slice 
500 (12) 


Compression ratio with 
chunk 125 (2) 


emphasis or a syntactic (part of speech) emphasis. Second, 
whether an operationalization employed lexical or syntactic em- 
phasis, operationalizations could also involve linear activity, that is 
adjacent sentences, or they could involve a Cartesian product over 
sentences (i.e., context beyond adjacent sentences), or they could 
address both types of activity. As an example, for the text char- 
acteristic, linear edit distance, the lexical-emphasis operationaliza- 
tion uses the words in two adjacent sentences, whereas a 
syntactical-emphasis operationalization uses parts of speech for 
replacement judgments. (Further detail is provided in the online 
supplemental materials.) 

A third logical approach was to use existing databases and 
resources where possible to create variable operationalizations. 
The following databases were used. The MRC Psycholinguistic 
Database (Coltheart, 1981) “. . . is a machine usable dictionary 
containing 150,837 words with up to 26 linguistic and psycho- 
linguistic attributes for each . . .” (MRC Psycholinguistic Da- 
tabase, n.d., para. 1). Number of phonemes in words, number of 


syllables in words, and indices of word abstractness were ex- 
tracted from the MRC Psycholinguistic Database. The Carnegie 
Mellon University Pronouncing Dictionary (Carnegie Mellon 
University, n.d., “About the CMU dictionary,” para. 1) “. . . is 
a machine-readable pronunciation dictionary for North Ameri- 
can English that contains over 125,000 words and their tran- 
scriptions.” It was used for variable operationalizations of the 
text characteristic, mean internal phonemic predictability. The 
Kuperman, Stadthagen-Gonzalez, and Brysbaert (2012) age-of- 
acquisition ratings for 30,000 English words was used for 
operationalizations of the age-of-acquisition text characteristic. 
The rating indicates the age at which a word’s meaning is first 
known. Word frequencies for running text in a corpus of 1.39 
billion words from 93,000 kindergarten through university texts 
(MetaMetrics, n.d.-b) normalized to link to Carroll, Davies, and 
Richman (1971) word frequencies, were used to create opera- 
tionalizations for word rareness. The Link Grammar Parser 
(Sleator & Temperley, 1991; Temperley, Sleator, & Lafferty, 
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2012) was used for operationalizations of Grammar. The Parser 
“. .. 18 a Syntactic parser of English, based on link grammar, an 
original theory of English syntax. Given a sentence, the system 
assigns to it a syntactic structure, which consists of a set of 
labeled links connecting pairs of words” (Temperley et al., 
2012, para. 1). 

Additional existing resources were as follows. The Menon and 
Hiebert (1999) decodability scale was slightly modified for opera- 
tionalizations of the text characteristic, decoding demand. The 
scale provides numeric values for varying degrees of within-word 
structural complexity. The Dolch word list (n.d.) and the first 660 
words on the Fry Word List (2012) lists were used in operation- 
alizations of the text characteristic, sight words. 

A fourth logical approach was to use techniques to control for 
factors that might be considered irrelevant to the measurement of 
specific text characteristics. One technique used for some opera- 
tionalizations of sounds-in-words and word-level text characteris- 
tics was stop listing (Luhn, 1958), which is commonly used in 
natural language processing computations. Stop listing means de- 
letion of the highest frequency words that tend to have low 
semantic value. However, because it is not known in advance 
whether deleting highly frequent words matters for examining text 
complexity, when stop listing was used for selected text charac- 
teristic operationalizations, the same text characteristics were also 
operationalized without stop listing. 

Another technique was aimed at addressing possible impact of 
text length on a text-characteristic value. In general, longer dis- 
course units can be related to increased complexity in part because 
inclusion of more material offers more opportunity for additional 
text characteristics or higher levels of individual text characteris- 
tics but also because each addition in a longer progression of 
discourse may require additional cognitive integration on the part 
of the reader (Merlini Barbaresi, 2003). Many text-characteristic 
operationalizations employed length control by using “slices” or 
“chunks” of text. When slices/chunks were employed, multiple 
slices/chunks were obtained from a text, covering the entire text, 
and then the final metrics were averaged over slices/chunks. 


Analyses 


Analyses were accomplished using a machine-learning logical 
analytical progression (Mohri et al., 2012). Random forest regres- 
sion was used for statistical modeling. The analyses performed for 
the present study are among the first to appear in the educational 
research literature and therefore deserve some added attention and 
description here. 

The statistical modeling approach. The interdisciplinary 
team of researchers who accomplished the present study worked 
from a statistical modeling approach that is not commonly used in 
educational research, but it is an approach that holds promise for 
some kinds of educational problems (Strobl, Malley, & Tutz, 
2009). Two cultures of statistical modeling derive from diverse 
epistemological terrains in which different ways of knowing un- 
dergird different paradigms and procedures (Breiman, 2001b). A 
classical statistical modeling paradigm in educational research 
progresses in a top-down fashion. A theory is created detailing 
which constructs hypothetically matter in relation to some out- 
come(s) and how the constructs are related to one another. Con- 
sideration is given to how the constructs can be measured, a 


relatively small set of “predictors” is selected, and the relationships 
are examined. Often a few interactions among predictors are 
hypothesized and represented in the statistical model. The resulting 
model is tested statistically through fit of the data to the originating 
model. 

In another statistical culture, the one used in the present re- 
search, the counterculture to the predominant educational statisti- 
cal paradigm, although theory can be involved initially (and was in 
our work), modeling works in a bottom-up fashion—starting with 
data (Breiman, 2001b). In the past years, multivariate data explo- 
ration methods have become increasingly popular in many scien- 
tific fields, including health sciences, biology, biostatistics, med- 
icine, epidemiology, genetics, and most recently, psychology, and 
in machine-learning communities (Grémping, 2009; Strobl et al., 
2009). “Machine learning” references construction, exploration, 
and study of algorithms and models that are “learned” or “trained” 
from data (Mitchell, 1997). Large amounts of data are processed, 
patterns are discovered, and predictor models are built. While 
some theoretical background is certainly helpful in discerning key 
constructs involved in a particular problem, there is no limit on the 
number of variables. Rather, all variables that can be imagined and 
measured are included as potential predictors. Sometimes, depend- 
ing on modeling choice, any and all possible interactions among 
variables can be accounted for. The result is a model of the 
important predictors (and interactions) associated with the out- 
come. The “goodness” of the model is tested through its predictive 
capacity using a previously “unseen” set of data. 

Random forest regression. The statistical modeling tech- 
nique used in the present research was random forest regres- 
sion—a nonparametric statistical analysis that involves an ensem- 
ble (or set) of regression trees (often referred to as CART— 
Classification and Regression Tree; Breiman, 2001a; Breiman, 
Friedman, Olshen, & Stone, 1984). Random forest regression 
overcomes limitations of a single regression tree and linear regres- 
sion for particular circumstances such as when large numbers of 
variables are involved (Hastie, Tibshirani, & Friedman, 2009; 
Strobl et al., 2009). It is called an ensemble procedure because 
predictions from many decision trees are aggregated to produce a 
single prediction. Decision tree regression is based on the principle 
of recursive partitioning, where the feature space (defined by the 
predictor variable operationalizations) is recursively split into re- 
gions containing observations (in our case, texts) with similar 
response values. The predicted value for a text in a region is the 
mean of the response variables for all texts in that region. For 
example in our study, the many regressions produce regions or 
classes where texts have similar text characteristics in relation to 
their text-complexity levels. (For a detailed explanation of recur- 
sive partitioning, see Strobl et al., 2009.) The procedure is called 
random forest because each individual decision tree is “trained” 
using a different random bootstrap sample of the texts and because 
each split within each tree is created using a random subset of 
candidate variables (Gromping, 2009). (Bootstrapping is a process 
of repeated resampling of the data, with each sample randomly 
obtained with replacement from the original data set.) Ultimately, 
from the forest (ensemble) of trees, a single prediction can be made 
by calculating a mean of predictions output by the individual trees 
(Gromping, 2009). 

Essentially, using the available data (in our case, the text- 
complexity level as outcome and 238 variable operationalizations 
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for each text as predictors), random forest regression builds a final 
model “from the ground up” by aggregating over many individu- 
ally “trained” models. (To better understand random forest regres- 
sion, and partly to better understand why it is potentially beneficial 
for analyzing text complexity, comparison to linear regression can 
be informative. A detailed comparison is provided in the online 
supplemental materials.) 

Steps in analyses. Initially, an automated computer analysis 
was conducted for the 350 digitized texts and the 89 passages that 
students read, resulting in values for each text and passage for 
text-complexity level and for the 238 text-characteristic variable 
operationalizations. Then, four analytical phases were accom- 
plished. (a) The first step in analysis was to set baseline perfor- 
mance. Eighty percent of the texts were randomly selected, and a 
three-pronged training phase was conducted using random forest 
regression. Three random forest regressions were conducted for: 
the 80% of the 350 texts that teachers ordered (n = 279; one text 
was discarded due to poor digitization), the 80% of the 89 student 
passages (n = 71), and the two sets of texts combined (n = 350). 
Each of the three random forest regressions yielded importance 
values for each of the 238 variables in relation to the text- 
complexity outcome variable. Model prediction capacity (correla- 
tion) and prediction error were calculated for each of the three 
models on “out-of-bag” samples (Grémping, 2009). (b) To deter- 
mine whether a more parsimonious set of variables could predict 
text complexity as well as, or nearly as well as, the 238 variables, 
a two-stage iterative variable-selection procedure was used 
(Gromping, 2009). First, for each of the three models, the least 
important variable was removed from the model, random forest 
regression was rerun, and prediction error was recalculated. The 
process was repeated until model prediction error began to in- 
crease, resulting in a moderately sized set of predictors for each of 
the three models. Then the union of predictors in the three models 
was selected creating a moderately sized set of predictors. Second, 
in a next round of variable elimination, redundant operationaliza- 
tions of text characteristics in the moderately sized set were 
identified, and the least important of the correlated redundant 
variables were trimmed out using a combination of strength of 
redundant operationalizations cut-point while maintaining model 
prediction capacity. (c) In a validation phase, the predictive ca- 
pacity for the trimmed model was investigated, using texts not 
employed for the variable selection and “training” phases—a 20% 
hold-out set of texts. (d) Follow-up analyses were done to explore 
the data structure. 


Results 


Preliminary Random Forest Regression Decisions 


The following decisions were made for conducting the random 
forest regressions using scikit-learn (Pedregosa et al., 2011): (a) At 
each node, the computer selected just one variable to make a split. 
(b) A constant predictor split point was used in each leaf. (c) Mean 
square error was used as the splitting objective to optimize in each 
node. (d) Randomness was injected into the trees using “bagging,” 
a method that allows all variables to be available for selection at a 
given node. During the training phase, “mtry” (the number of 
predictors available for selection) was set at 238. During the 
validation phase, “mtry” was set at three (or the square root of “p” 


where “p” was nine predictors). The larger “mtry” was used when 
there was a moderate or large number of correlated predictors, 
because in the case of many predictors more power is concentrated 
in a relatively small subset of predictors. For variable selection, 
concentration of power is desirable, and as well, large mtry results 
in more stable variable selection because the most powerful vari- 
ables tend to emerge repeatedly. (e) For variable selection, each 
random forest model was conducted with 100 trees. In the valida- 
tion phase, random forest regressions were conducted with 500 
trees. (f) The importance values were normalized random- 
permutation-based. (g) During training, out-of-bag model error 
(root-mean-square error [RMSE], for which error is normalized 
relative to the number of texts) was calculated as an estimate of 
generalizability error (Breiman, 2001la). During the validation 
phase, non-out-of-bag RMSE was calculated (Breiman, 2001a). 


Phase 1. Training Phase Results: Baseline 
Model Performance 


For the model using the 279 texts that teachers ordered and all 
238 text-characteristic operationalizations, the mean correlation of 
text complexity as predicted from the model with the empirical 
text-complexity measures from 10 analytical runs of 100 trees each 
was .89, and the model error (RMSE) was 8.66. For the model 
using the 71 passages that students read and the 238 text- 
characteristic operationalizations, the mean correlation was .69, 
and the RMSE was 10.58. For the model combining the two sets 
of texts (n = 350), the mean correlation was .87, and the RMSE 
was 8.72. For each of the three models, predictive power was high, 
and error was low. (Importance values were computed for all 238 
predictor variables in each of the three models, but given the large 
number of variables, only the final model variable importance 
values are reported in a following section.) 


Phase 2. Trimmed Model and Final 
Operationalization Descriptives 


First, Figure 1 shows that for two of the three models, as the 
least important operationalizations were dropped from the model, 
one by one, model correlation, that is, the predictive capacity, 
began to visibly drop for the teacher and combined models when 
approximately 25 variable operationalizations were left in the 
model. For the student model, it dropped with approximately 10 
variables remaining. The union of the top 25 operationalizations in 
each of the three models was then selected, resulting in 45 pre- 
dictor operationalizations. Then one model was created for the 
next step using the 45 predictor variable operationalizations. 

Second, the first trim included redundant variable operational- 
izations for single text characteristics. To eliminate redundancies 
that were highly correlated, the intercorrelations of all 45 predic- 
tors were computed, using the combined data set. Then in the top 
of Figure 2 potential correlational thresholds are shown on the 
x-axis, and the y-axis shows what the model correlation would be 
if redundant variable operationalizations were removed using dif- 
ferent magnitudes of threshold correlation as cut-points to delete 
redundant variables. Through visual inspection of the top graph, 
.10 was chosen as the correlational cut-point because it appeared 
that doing so would result in only very slight model correlation 
drop while removing a significant number of redundant predictors. 
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Correlation of predicted with empirical text-complexity in relation to least important variable 


deletion from each of three models. The top line represents correlational changes for teacher judgment, the 
middle line represents correlational changes for the combined teacher and student text-complexity assignments, 
and the bottom line represents correlational changes for the student text-complexity assignments. Also, out-of- 
bag correlation is used. At each point on the x-axis there is a central point that is the mean of the out-of-bag 
correlations from 10 independent random-forest runs. Surrounding the central point are error bars that represent 


the standard deviations from the 10 runs. 


Then, as shown in the bottom graph in Figure 2, using the 
threshold cut-point of a .70 correlation, 11 variable operational- 
izations remained in the model. Among the 11, two sets of opera- 
tionalizations were highly similar, and in each case, the least 
important of the two was dropped. In sum, the model trimming 
procedure resulted in a nine-predictor model, with the predictors 
noted here in parentheses by linguistic level: for word structure 
(decoding demand, number of syllables in words), for word mean- 
ing (age of acquisition, abstractness, and word rareness), and for 
sentence and discourse level (intersentential complexity, phrase 
diversity, text density/information load, and noncompressibility). 

After variable selection, a final set of three random forest 
regression models was trained using only the nine variables 
(mtry = 3) with the teacher text-complexity assignments, the 
student assignments, and the two combined together. The resulting 
correlations (and RMSEs) for the teacher, student, and combined 
models were .89 (8.40), .71 (10.35), and .88 (8.59), respectively. 


Phase 3: Model Validation 


To validate the model, the hold-out set of 20% of books (n = 
71) and 20% of the passages for student reading (n = 19) was 
combined. A final random forest regression (mtry = 3) was run 
with the nine selected variables as predictors and the empirical 
text-complexity variable from the combined (teacher and student) 
data as the outcome. The model was validated with a correlation of 
.85 and RMSE of 9.68. Figure 3 shows the generally tight rela- 
tionship among the nine predictors and text complexity level. 
Variance explained by the model was 71.98%. Of note, the vali- 


dation model error was similar to the combined data set model 
during training (8.72), suggesting minimal, if any, model overfit. 


Variable Importance Values, Descriptives (Including 
Text Complexity), and Intercorrelations 


Finally, after the validation phase, mean importance values were 
obtained from 10 final random forest regressions with 500 trees 
and mtry set at 3, using the 350 texts (Grémping, 2009). The 
variable importance values, mean, standard deviation, and range 
for the final nine variables along with mean, standard deviation, 
and range for the text-complexity variable, are shown in Table 2. 
The order of text-characteristic importance was: intersentential 
complexity (the linear edit distance operationalization; most im- 
portant), text density/information load, phrase diversity (the lon- 
gest common string operationalization), age of acquisition, number 
of syllables in words, abstractness, decoding demand, noncom- 
pressibility, and word rareness. Notably, three discourse-level 
characteristics appeared near the top of the importance order 
suggesting relative strength of discourse-level characteristics for 
predicting text complexity. Also included were word-structure and 
word-meaning text characteristics. While no variable that repre- 
sented within-sentence text characteristic alone emerged, the 
discourse-level variables indirectly included facets of within- 
sentence characteristics— because to create measures across sen- 
tences, within-sentence characteristics had to be taken into ac- 
count. 

The text-characteristic variable operationalization means for the 
word structure variables (decoding demand and number of sylla- 
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empirical text-complexity measure. 


bles in a word) suggested that across the entire set of texts word 
structure was moderately challenging, though the range for decod- 
ing demand was wide—up to 7.91 (out of 9). (See Table 2 for 
summary statistics.) The means for the word meaning variable 
operationalizations (age of acquisition, abstractness, and word 
rareness) again suggested that on the whole, the abstractness of the 
words in the text was moderate (approximately at the middle of the 
possible range of scores), but as would be expected, word rareness 


Empirical Text Complexity 


8 28 40 
Predicted Text Complexity 





was minimal and age of acquisition tended to be low—though 
again, for all three variables, the standard deviations suggested a 
wide range of values. Means for the discourse level variable 
operationalizations suggested that the text corpus involved a fair 
amount of repetition, redundancy, and patterning in that means for 
three of the variables ranged from .55 (for noncompressibility, a 
compression ratio that could range from 0 to 1) to .80 (for phrase 
diversity, longest common string that could range from 0 to 1), 
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Figure 3. Scatterplot depicting the final model during validation. 
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Table 2 


Importance Values for the Nine Text-Characteristics Variables and Descriptives for Text Characteristics and Text Complexity 


Variable 


Text complexity 
Text characteristics 
Word structure 
Decoding demand (7) 
Number of syllables in words (5) 
(ability at 75%) 
Word meaning 
Age of acquisition (4) 
words (ability at 50%) 
Abstractness (6) 
words (ability at 50%) 
Word rareness (9) 
Discourse level 
Intersentential complexity (1) 
Phrase diversity (3) 


Mean linear edit distance 


percentage with slice 125 
Text density: Information load (2) 


Variable operationalization 


Mean with stop list 50 most frequent words 
Types as test with stop list 50 most frequent 
Types as test with stop list 50 most frequent 
Types as test with stop list 50 most frequent 


Types as test (ability at 90%) 


Mean Cartesian longest common string 


Normalized percent reduction of information 


M importance 


load across sentences, 10 dimensions with 


slice 500 
Noncompressibility (8) 


Compression ratio with chunk 125 


value (SD) M (SD) Range 
50.10 (18.85) 0.33-100.00 
.0164 (.0017) 5.32 (0.97) 2.00-7.91 
.0633 (.0038) 1.42 (0.24) 0.00*—2.42 
.0917 (.0073) 3.67 (0.52) 2.41-5.26 
.0557 (.0040) 384.35 (63.11) 199.80-700.00 
.0064 (.0004) 1.29 (0.29) 0.54-2.23 
3487 (.0125) 31.04 (17.37) 0.00-109.88 
.1782 (.0090) 0.80 (0.13) 0.31-1.00 
.2313 (.0116) 0.76 (0.10) 0.22-0.89 
.0084 (.0006) 0.55 (0.11) 0.25-1.00 


Note. Permutation accuracy importance values were used following Strobl, Malley, and Tutz (2009). Numbers in parentheses in the first column indicate 
rank order for importance value of descriptives for 350 texts, with values ranging from 1 (most important) to 9 (least important). 


“Zero scores occur when all the words in the text are on the stop list. 


with intersentential complexity reflecting such features more mod- 
estly. In all four cases, nearly the complete range of values was 
represented in the corpus, suggesting a fair amount of variability 
on the discourse-level text characteristics. Finally, the full range of 
text-complexity values was witnessed, with a mean of 50.10. 
The correlations in Figure 4 indicate moderately positive rela- 
tionships of all nine variable operationalizations with text com- 
plexity, ranging from .35 to .73 with the exception of noncom- 
pressibility (.18, though significant). Next, on the whole, variable 
operationalizations within word structure, within word meaning 
(see the left-most triangle in Figure 4), and within discourse level 
(see the right-most triangle in Figure 4) were, on the whole, 
moderately correlated with each other, though in each of the three 
groups, there were one or two low correlations, suggesting that 
within linguistic level variable operationalizations tended to cap- 
ture similar text characteristics. Also, on the whole, the cross- 
group correlations tended to be somewhat lower than within-group 
correlations, suggesting to some degree that each group of 
variables was measuring a unique set of characteristics (see the 
boxes in Figure 4). That is, correlations of decoding demand 
and number of syllables in words correlated with the three word 
meaning variable operationalizations from .06 to .54, all lower 
than .66, the correlation of decoding demand with number of 
syllables in words. The top right-most box shows a similar 
pattern. For the comparison of the word-meaning variable 
within-group correlations (the left-most triangle in Figure 4) 
versus the cross-group correlations of word meaning with dis- 
course level variable operationalizations (the bottom box in 
Figure 4) again, on the whole, the within-group word-meaning 
correlations (.34 to .57), not including the low correlation of 
abstractness with word rareness (.05), tended to be similar to, or 
higher than, the cross-group comparison to the discourse-level 


correlations (with the exception of the correlation of age of 
acquisition with intersentential complexity, .12 to .53). 


Exploring the Data Structure and the Text- 
Characteristic Interplay 


Several follow-up analyses (using all 350 texts and the teacher- 
based empirical text-complexity levels) were done to explore the 
data structure, the degree of text-characteristic variability in high 
versus low text-complexity levels, the interplay of text character- 
istics in relation to text complexity levels (decision trees and 
quintiles), and the interplay of text characteristics in relation to 
genre. The analyses were conducted using visualization method- 
ology from CARTscans (a graphical tool that displays predicted 
values across multidimensional subspaces; Nason, Emerson, & 
LeBlanc, 2004), along with additional visualization techniques 
recommended by Cook and Swayne (2008) and by Cohen, Cohen, 
Aiken, and West (2003). A strong theme permeated findings—the 
interplay of text characteristics was ati important factor for ex- 
plaining text complexity. 

The general structure of text characteristics in relation to 
text complexity. In a traditional approach, principal components 
analysis or factor analysis might be used to describe the data 
structure, but those techniques assume a linear relationship among 
variables. We hypothesized nonlinearity and used an unsupervised, 
nonlinear dimension-reduction technique—modified locally linear 
embedding analysis (Zhang & Wang, 2007). The technique ac- 
counts for the intrinsic geometric properties of each neighborhood 
of texts that share text-characteristic profiles. Essentially, in the 
analysis, the nine text characteristic operationalizations were re- 
expressed in a three-dimensional space by finding local planes of 
best fit for the neighborhood around each text (set at 15 neighbors; 
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Figure 4. Correlations among the final nine text characteristics and text complexity. The top left-most box 
indicates the cross-group correlations for word-stucture variable operationalization with the word-meaning 
variable operationalizations. The top right-most box indicates the cross-group correlations for word-structure 
variable operationalization with discourse-level variable operationalizations. The left-most triangle indicates 
within-group correlations for word-meaning variable operationalizations. The bottom box indicates the cross- 
group correlations for word-meaning variable operationalizations with discourse-level variable operationaliza- 
tions. The right-most triangle indicates within-group correlations for discourse-level variable operationalizations. 
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Vanderplas & Connolly, 2009) and then stitching them together to 
describe the entire 350-text space. The planes of best fit need not 
share the same parameters across neighborhoods. Once the 
dimension-reduced text space was constructed, the text-complexity 
levels were noted in colors, warmer colors represent higher text- 
complexity levels, and cooler colors represent lower text- 
complexity levels. The result is shown in Figure 5. The three 
locally linear dimensions are not in themselves interpretable. Each 
is associated to varying degrees with the nine text characteristics. 
All 350 texts are represented as dots in the space. The main 
conclusion of the visual analysis was that there was a clear thread 
of text-characteristic relationships with each other and with text 
complexity that moved through the space, a thread that suggested 
an essentially unidimensional construct in measurement terms, but 
the text-characteristic relationships with text complexity were not 
globally linear. Instead, text-characteristic relationships inter- 
played differently in different local neighborhoods. 

Degree of text-characteristic variability in high versus low 
text-complexity levels. To examine the extent to which text- 
characteristic variability was different according to text-complexity 
level, the nine text-characteristic variables were standardized as 
z-scores, and texts were split into high and low text-complexity 
groups using the following procedures (outlined in Cohen et al., 2003, 
and Green & Salkind, 2011). Centers for the high and low texts were 
determined at one standard deviation above and below the total 
text-set mean, respectively. Next, bands for high and low texts were 
created at plus and minus half of a standard deviation around the mean 


of the center points, respectively, so as to filter out texts close to the 
mean (Cook & Swayne, 2008). Finally the split plots in Figure 6 were 
generated. 

A main conclusion was that for most sets of relationships, there 
was more variability in lower text-complexity texts than in high 
ones. For the two word-structure relationships with text- 
complexity level, the decoding-demand levels for the low- 
complexity texts ranged widely, while most decoding-demand 
levels for high-complexity texts were tightly collected around the 
mean. For the two of the three word meaning characteristic op- 
erationalizations (age of acquisition and word rareness), the vari- 
ability patterns were highly similar for low and high text- 
complexity texts, but for higher complexity, the word meaning 
values were shifted upward by approximately two standard devi- 
ations. On the other hand, for three of the four discourse-level 
variables (intersentential complexity, phrase diversity, and text 
density) there was little to no overlap in the two patterns, signaling 
a dramatic shift in the degree of repetition, redundancy, and 
patterning—less of it (higher values) in the higher complexity 
texts. 

Also evident in the split plots are outlier texts. For instance, in 
the low text-complexity group for age of acquisition, there were 
some texts that had relatively high age-of-acquisition values, lead- 
ing to the question of how a book with such high values on that 
text characteristic might receive a low value on text complexity. A 
general pattern appeared from examination of complete profiles of 
text characteristics for some randomly selected “outlier” texts. 
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Figure 5. Three-dimensional scatterplot showing the data structure. Color represents text-complexity level, 
with red representing the highest text-complexity level, orange and then yellow the next highest, green and 
lighter blue moving lower, and blue the lowest. Each point is a text. MLLE = modified locally linear embedding. 


Where extreme values were present in low text-complexity texts, 
generally, the high values tended to be compensated by low values 
on other text characteristics. For example, a text’s relatively high 
value on a word structure or word meaning characteristic was 
modulated and supported by a high degree of repetition, suffi- 
ciently enough to effect a relatively low text-complexity Jevel. 

Interplay of text characteristics: Generalized interactions or 
regions of interactions? Two ways to explore the potential for 
text characteristics to function together in relation to text- 
complexity level were visualization of a single regression tree and 
contour plots (Nason et al., 2004). First, we created a single 
regression tree (see Figure 7) using standardized z-score values for 
the predictor variable operationalizations, with the tree grown to 
five levels of depth and restricting nodes to a minimum of 10 texts. 
The goal was to visualize the degree to which text characteristics 
might be conditioned on one another when predicting text com- 
plexity—not to determine which variables interacted with one 
another in the classic statistical sense. While information can be 
gleaned from exploring a single regression tree, generalization to 
early-reader texts at large is cautioned because of the possibility of 
single-tree overfit to a data set (Breiman 2001a). 

Two main findings from examination of the decision tree 
were that the interplay of text characteristics mattered for text 
complexity and that microinteractions among text characteris- 
tics were regional rather than generally applicable to the whole 
body of text characteristics and text complexity. The tree de- 
picts several localized interactions, or ways that text- 
complexity values may be predicted from combinations of 
certain text characteristics such that the impact of a text char- 
acteristic is conditioned by the value of one or more other text 
‘characteristics (two are noted in the dotted boxes in Figure 7). 
As an example, the far right side of the regression tree in Figure 
7 depicts a localized asymmetrical interaction. Starting at the 
top of the regression tree in Figure 7, the computer algorithm 


made the first split using intersentential complexity as the 
predictor that would result in the least error in predicting text 
complexity. To the right are texts that have intersentential 
complexity values higher than —.3045, that is, not much repe- 
tition, redundancy, or patterning. Moving farther to the right to 
Node B (which split the high intersentential complexity texts 
into even further subgroups of higher and lower intersentential 
complexity) and then Node C, the 109 texts at Node C have the 
least amount of repetition, redundancy, or patterning of the 350 
texts. At Node C abstractness was selected as the predictor that 
conditioned intersentential complexity so as to achieve the 
smallest error in predicting text complexity. Notice that for 11 
of the 109 texts, the ones with the lowest abstractness values, no 
further predictors were required to arrive at the final text 
complexity value with the smallest error. However, 98 of the 
109 texts that had higher values on abstraction were further 
conditioned by noncompressibility and after that by age of 
acquisition. That is, the effect of abstractness is different for the 
two branches created by intersentential complexity. 

Another interesting subtle finding reflecting the interplay of 
text characteristics that can be visuaiized from the regression 
tree is that sometimes slightly different combinations of 
text-characteristic conditioning can result in approximately the 
same text-complexity level. Notice for instance among the four 
left-most boxes just above the bottom row in the figure that two 
sets of texts have text complexity levels of 21.60 and 22.57, 
respectively. While both share similarly low intersentential com- 
plexity, for the left-most texts (21.60), conditioning intersentential 
complexity by the presence of higher word rareness values resulted 
in approximately the same text-complexity value as the right-most 
texts (22.57) where intersentential complexity was conditioned by 
lower values on noncompressibility. 

A second way to explore potential interplay among variables 
was to visually examine contour plots (Nason et al., 2004). 
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Figure 6. Split plots for individual text-characteristic variable relationships with low and high text-complexity 
levels. Top clusters (red/gray) are high text-complexity texts. Bottom clusters (blue/black) are low text- 
complexity texts. See the online article for the color version of this figure. 


Several were created for selected combinations of text charac- 
teristics. A general finding was that there was interplay among 
the text characteristics in relation to text complexity. A limita- 
tion of contour plots is that a maximum of two predictors can be 
plotted. Figure 8 illustrates the interplay of age of acquisition 
with phrase diversity in relation to text-complexity level. The 
plot was generated from a random forest regression with just the 
two text-characteristic variable operationalizations and text- 
complexity level as the outcome, without controlling for the 
other seven text characteristics and with minimum node size of 
five. The main finding from the illustrative contour plot was 
that age of acquisition was conditioned by phrase diversity in 
relation to text complexity. Regions of texts are seen in the plot. 
The highest values on text complexity (red in the plot) occurred 
in texts that had high values on age of acquisition and high 
values on phrase diversity (low amounts of repetition, redun- 
dancy, or patterning). As well, texts with the lowest text- 
complexity values (dark blue) tended to have low values for age 
of acquisition and phrase diversity. However, some texts (e.g., 
light blue in the lower right quadrant) that had high values on 
age of acquisition had low text-complexity values when age of 
acquisition was moderated or conditioned by low values on 
phrase diversity, that is, when a fair amount of repetition, 


redundancy, or patterning was present. The point is, again, there 
is interplay of text characteristics in relation to text-complexity 
level. 

Text characteristic profile changes as text-complexity level 
increased. Another visualization method to understand text- 
characteristic collective patterning was to examine text charac- 
teristic profiles as text-complexity level increased (Cohen et al., 
2003). The nine text characteristics were standardized as 
z-scores, texts were formed into quintile groups, and a graph 
was plotted using the within group means. As shown in Figure 
9, first, the lowest quintile texts had a profile pattern that is 
markedly different from the other patterns. On average, the 
texts were characterized by less complex word structure (low 
decoding demand and relatively few syllables), relatively low- 
level vocabulary (younger age of acquisition, not very abstract 
words, and words that were not as rare as what appeared in 
more complex texts), coupled with, on the whole, highly re- 
dundant and repetitive texts (the exception is noncompressibil- 
ity; recall that lower scores on the discourse level variables 
meant more redundancy and patterning). Moving up the graph, 
the next two quintile patterns were highly similar to one an- 
other, and the highest two quintile profiles were nearly flat with 
minor exceptions. In essence, text-characteristic profiles grad- 
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ually changed as text complexity increased. Second, word struc- 
ture became increasingly complex with each rising quintile. As 
well, on the whole, word meanings became harder and harder as 
text complexity increased. The exception was word rareness, 
which was similar in the bottom two quintiles. Also, on the 
whole, discourse-level redundancy and repetition decreased as 
text complexity increased (recall that higher discourse level 
averages reflected less redundancy and repetition). Noncom- 
pressibility was a minor exception in that although texts were 
consistently less compressible as text complexity increased, the 
changes were less dramatic than for other discourse-level vari- 
ables or for word structure and word meaning characteristics. In 
short, on the whole, as text complexity increased, word struc- 
ture and word meanings became harder, and texts displayed less 
and less redundancy, repetition, and patterning. Again, the 
interplay among the text characteristics was an important factor 
for text-complexity level. 

Genre effects. Genre effects were analyzed using the same 
procedures as noted in the preceding section on “Degree of 
text-characteristic variability in high versus low text- 
complexity levels” (Cohen et al., 2003; Green & Salkind, 
2011). Four groups of texts were created—narrative and infor- 
mational texts that were high text complexity and narrative and 
informational texts that were low text complexity, and the 


Il 


text-characteristic profile differences across genre, controlling 
for text-complexity level, were examined. Only texts identified 
as narrative or informational were included in the analysis 
because hybrid or other texts were rare. Text complexity means, 
standard deviations, ranges were comparable for the narrative 
and informational high text-complexity texts, and they were 
comparable for the two genres within low text-complexity texts: 
for high text-complexity narratives (n = 64), 67.16, 4.85, 59.86 
to 78.19; for high text-complexity informational (n = 24), 
67.39, 4.81, 60.34 to 77.02; for low text-complexity narratives 
(P= 2.61) S312 s5e5 5605 22:14) to84050--and* tormlowmtext= 
complexity informational (n = 17), 32.88, 5.07, 24.11 to 40.43. 
Finally, the nine text characteristics were standardized as 
z-scores, and using the text-characteristic within-group means, 
the graph in Figure 10 was created to show the four text groups’ 
text-characteristic profiles. 

In general, as would be expected, controlling for text-complexity 
level, the genres within text-complexity level had slightly dif- 
ferent text-characteristic profiles. For high text-complexity nar- 
rative texts, on average, abstractness, intersentential complex- 
ity, phrase diversity, and text density tended to have higher 
levels than the other text characteristics. On the other hand, for 
high text-complexity informational texts, only age of acquisi- 
tion, on average, tended to rise above the other text- 
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Figure 8. Contour plot of age of acquisition, phrase diversity, and text complexity. 


characteristic levels, and, on average, noncompressibility 
tended to dip below all other text characteristic levels. Notably, 
several text characteristics were at approximately the same 
levels in the two genres. The most divergent characteristics 
across high-text-complexity text genres were age of acquisition 
(higher for informational texts) and word rareness (also higher 
for informational texts). 

For low text-complexity narrative texts, on average, text- 
characteristic levels were approximately similar, with the excep- 
tion of noncompressibility, which is, surprisingly, much higher 
than the others. For low text-complexity informational texts, on 
average, decoding demand, syllables, and word rareness tend to be 
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higher than the other informational text characteristics. Notably, 
several text characteristic levels were similar across the two low- 
text-complexity genres. The most divergent were decoding de- 
mand, syllables, word rareness—all higher measures for informa- 
tional texts—and noncompressibility, which was higher for 
narratives. Again, another example of text-characteristic interplay 
was witnessed. When word structure and word meanings were 
relatively difficult (as for informational texts compared to narra- 
tives), more repetition and patterning at the discourse level (real- 
ized by relatively low scores) likely modulated the impact of the 
difficult words to bring the overall text complexity to a relatively 
low level. 
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Figure 9. Text-characteristic profiles by text-complexity quintile group. 
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Figure 10. Text-characteristic profiles according to text-complexity level and genre. The top two lines 
represent high text-complexity levels. The bottom two lines represent low text-complexity levels. Solid lines 
represent narrative texts, and dotted lines represent informational text. 


Conclusions and Discussion 


Conclusions 


Nine text characteristics were most important for early- 
grades text complexity: word structure— decoding demand and 
number of syllables in words; word meaning—age of acquisi- 
tion, abstractness, and word rareness; and sentence and dis- 
course level—intersentential complexity (the linear edit dis- 
tance operationalization), phrase diversity (the longest common 
string operationalization), text density/information load, and 
noncompressibility. The nine-characteristic model predicted 
text complexity very well, in fact, nearly as well as the more 
complicated model with all 238 text-characteristic operational- 
izations. Notably, the three most important text characteristics 
were at the sentence and discourse level—intersentential com- 
plexity, text density/information load, and phrase diversity. 
Additionally, interplay among text characteristics was impor- 
tant to explanation of text complexity. While a clear thread of 
the relationship of the nine text characteristics with text com- 
plexity was evident, the relationship was not globally linear. 
Instead, text-characteristic relationships interplayed differen- 
tially in local neighborhoods of similar texts. 


Discussion 


To our knowledge, the present study is the first to reveal 
important text characteristics for early-grades text complexity 
through empirical investigation. The results support the contention 
that early-grades texts can be considered complex systems con- 
sisting of characteristics at multiple linguistic levels that variously 
interplay to impact text complexity. Further the nine most- 
important text characteristics revealed in the present study map to 
some of the well-researched critical features of young children’s 


early reading development. The early-grades developmental phase 
is often characterized as “cracking the code,” which has led some 
educators to believe the work of early reading is primarily about, 
or even all about, phonological awareness and word-related fac- 
tors. Interestingly, phonemic measures did not surface among the 
most important text characteristics for text complexity. The im- 
portance of phonological awareness for progress in early reading is 
indisputable. Possibly the measures in the current study did not 
sufficiently reflect the domain of key phonological knowledge 
required of students. 

As for the centrality of word structures in “cracking the code,” 
it was not surprising to find that word decoding and number of 
syllables were among the top-most important for predicting text 
complexity. As well, factors involved in word meanings, specifi- 
cally age of acquisition of words, abstractness, and word rareness, 
were important. The findings are consistent with prior suggestions 
that lower text complexity might be achieved in part through 
inclusion of easier and more familiar vocabulary (e.g., Hiebert & 
Fisher, 2007). 

At the same time, aspects of the findings in the present study 
shed additional light on the distinctiveness of early-grades text 
complexity compared to upper-grades text complexity. While tra- 
ditional measures of within-sentence syntax (such as sentence 
length or various grammatical indices) were not among.the nine 
most important text characteristics, some of the discourse-level 
metrics captured within-sentence complexity while also measuring 
text characteristics beyond the sentence level. For instance, while 
the intersentential complexity metric, linear edit distance, ad- 
dressed the degree of word, phrase, and letter repetition across 
adjacent sentences, it was also impacted by overall sentence length 
irrespective of patterning and repetition. That is, linear edit dis- 
tance captured both within and across-sentence characteristics. 
Consequently, within-sentence features were necessarily included. 
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Still, it is worth noting that traditional within-sentence indicators 
such as sentence-level syntax or sentence length itself were not 
among the critical metrics for early-grades text complexity. One 
possible reason is that although within-sentence indicators tend to 
be highly associated with complexity for texts beyond second 
grade, many early-grades texts that have long sentences tend to 
have long sentences that are marked by repetition of words or 
phrases. The repetition of words or phrases in early-grades texts 
may reduce the challenge posed by long sentences and render 
within-sentence indicators, such as length, less effective for esti- 
mating early-grades text complexity. 

One of the most striking findings was the emergence of 
discourse-level text characteristics that primarily captured repeti- 
tion, redundancy, and patterning in texts. The finding was striking 
because it is often not discussed in the context of “code cracking.” 
Educators and researchers tend to focus on word-level text char- 
acteristics as almost singularly critical for early reading, and the 
role of how texts are structured to facilitate ease of early-reading 
progress is often overlooked. Indeed, even one of the most com- 
monly used text-leveling systems, the Fountas and Pinnell (1996, 
2012) system, does not directly include attention to repetition and 
redundancy, though they do address text structure and genre in 
general. As noted earlier, few prior text-analysis systems for the 
upper grades include analysis of discourse-level characteristics— 
although those systems were not intended for early-grades texts. 
However, at least one or two of the discourse-level characteristics 
(intersentential complexity and phrase diversity) in the present 
study are reminiscent of cohesion operationalizations in the Coh- 
Metrix (Graesser et al., 2011) system. While some evidence exists 
that above second-grade level, models of text complexity that 
include discourse-level indicators do not outperform those that do 
not include them (Nelson, Perfetti, Liben, & Liben, 2011), our 
findings suggest that attention to discourse-level characteristics at 
the early grades is crucial (cf. Hiebert & Pearson, 2010, who 
suggested that current text-complexity systems may need adjust- 
ments for early-grades texts). Indeed, the functions of repetition 
and redundancy in discourse have received increasing attention on 
the part of linguists in the past few years, and repetition/redun- 
dancy is considered by some to be an essential feature of language 
use (Bazzanella, 2011). 

Unearthing the presence of locally embedded differential inter- 
play of text characteristics and witnessing examples of that inter- 
play are novel contributions to the literature. The finding was 
intriguing in that to the mature eye, early-grades texts appear to be 
“simple.” But experienced readers often have long forgotten the 
challenges of learning to read in the early phases, and to more 
expert readers, as Prince (1997) and others (e.g., Bazzanella, 2011) 
have pointed out, “. . . the really interesting complexities of 
language work so smoothly that they become transparent” (Prince, 
1997>p: 117). 

The finding of locally embedded text-characteristic interplay 
was also supportive of prior linguists’ and complexity theorists’ 
understandings that in complex environments, subsystems (in the 
present study, sublinguistic systems) often “cooperate” to balance 
efficiency and effectiveness. In the case of early-grades texts, 
subsystems “cooperate” to balance young children’s ease of learn- 
ing to read with the requirements for depth of processing (Bar- 
Yam, 1997; Juola, 2003; Merlini Barbaresi, 2003). However, while 
the presence of regional interactions among text characteristics 


could be witnessed, as for example, in the single decision tree and 
the contour plot, explaining or describing them with simple gen- 
eralizations was difficult because of the number of characteristics 
involved and the variation in coexisting characteristics across 
witnessed incidents of interactions. 

Although local interplay was a chief characteristic of early- 
grades text complexity, some general trends described features of 
the early-grades texts in the aggregate. One general trend was that, 
on the whole, as text-complexity level increased, word structure 
and word meaning text characteristics became more complicated 
or harder (as would be expected), while texts displayed less and 
less redundancy, repetition, and patterning. That is, linguistic 
levels interplayed such that text characteristics tended to coalesce 
in one way for less complex texts and in another way for more 
complex texts. 

Another general trend was for high-complexity informational 
texts to have somewhat higher age-of-acquisition and word rare- 
ness measures compared to narrative texts. On the other hand, 
low-complexity informational texts tended to have somewhat 
higher decoding demand, more syllables, and rarer words than 
narratives, but narratives were less compressible. For both high- 
and low-complexity texts, interestingly, discourse-level text char- 
acteristics were fairly similar across the two genres with informa- 
tional texts having slightly lower discourse-level values, indicating 
more repetition, redundancy, or patterning. The result again sup- 
ports the interplay of variables in that the presence of more 
difficult words was compensated by increased scaffolding in the 
form of repetition or patterning. The difference should be consid- 
ered with caution, as a relatively small number of books consti- 
tuted the genre analysis. Rather than assuming the result is gen- 
eralizable, it is more appropriate to consider it sufficiently 
provoking to warrant further analysis in future studies. 

However, taken at face value, the genre result is consistent with 
logical expectations. In general, at the early-grades levels, infor- 
mational texts might tend to have more difficult vocabulary than 
narratives, and at the lowest text-complexity levels, it would be 
challenging to lower decoding demand for content-laden material. 
It is worth noting that when using random forest regression with 
the nine-characteristic text-complexity model, random forest re- 
gression easily accounts for any localized or general text charac- 
teristic collections that might be related to genre. 

The promise of random forest regression and machine- 
learning research methods. ‘The successful use of random for- 
est regression for modeling text complexity in early-grades texts 
demonstrates the potential for the random forest regression advan- 
tage when addressing a high-dimensional educational problem. In 
the case of early-grades text complexity, a modeling technique 
such as linear regression may not satisfactorily allow for investi- 
gations employing either the large number of variables required for 
text analysis or the potentially huge number of complex text! 
characteristic interactions that likely permeate early-grades texts. It 
is important to note, however, that we did not accomplish a 
comparison of results from a theorized linear regression model and 
a random forest model, and consequently our statement here about 
the possible random forest regression advantage is hypothetical. At 
the same time, it is difficult to imagine how such a comparison 
could be tested—because there is no way to tap a priori localized 


interactions among text characteristics in traditional linear regres- 
sion. 
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As well, random forest can be a more robust model than some 
other traditional modeling techniques in that it accounts for ex- 
ceptional cases. To comprehensively study early-grades texts, 
where many different types of text exist, it is important to include 
even those texts that might traditionally be considered “outliers,” 
that is, texts that might have text-characteristic configurations that 
fall in the long tails of early-grades text distributions. For instance, 
label books do not contain connected text, but instead one word is 
shown beside a picture. In a traditional analysis, such books might 
be considered outliers because they have text characteristics that 
are quite different from a majority of texts. However, label books 
are commonly used in early-grades classrooms, and any study of 
text complexity should take them into consideration. As well, 
random forest regression automatically handles conditionality that 
can occur in ensembles of text characteristics, and as such it brings 
the tails of distributions “into the fold.” 

Finally random forest regression can take advantage of a weak 
predictor by using it only when it is needed. In the present study, 
noncompressibility might be considered a weak predictor in that it 
was not highly correlated with other characteristics (except for 
phrase diversity) or with text complexity. However, noncompress- 
ibility tended to locate repetition, redundancy, and patterning 
where the other three discourse-level characteristics did not locate 
it. Such texts were rare in the present study, but on those rare 
occasions, there was important value in the noncompressibility 
measure. 

High-dimensional problems are common in educational arenas 
in cases where large numbers of variables are at play and large 
amounts of data are generated, and random forest regression is a 
statistical modeling technique that could innovate the repertoire of 
educational statistical modeling. Where pressing educational prob- 
lems involve large numbers of variables and/or potentially large 
numbers of interactions among variables, random forest regression 
could provide uniquely satisfying solutions (Baca-Garcia et al., 
2007). 

The machine-learning techniques used in the present study 
uniquely revealed early-grades text complexity. While prior text- 
complexity systems existed, theorization about text complexity, 
especially early-grades text complexity, was limited (Mesmer et 
al., 2012), and debates about construct coverage in the existing 
measurement systems proliferated (e.g., Sheehan et al., 2010). As 
a consequence, employing a wide array of possible operational- 
izations of text characteristics, each of which might capture a 
nuanced sense of any text characteristic, was important, as was the 
use of a logical investigative progression to narrow the most 
important characteristics. That is, through machine learning tech- 
niques, the data could “speak,” and a text-complexity model could 
be constructed from the data themselves (Wasserman, 2014). 

Further, the interactive, dynamic graphics used to explore data 
structure are common in machine-learning communities, but not as 
common in educational research. While no statistical significance 
was attached to the visualization techniques, they tended to be very 
useful in understanding functional relationships among text char- 
acteristics and text complexity. 

Limitations of the study. The following limitations of the 
study should be considered as context for interpreting the findings. 
First, although random forest provided many advantages for the 
study of early-grades text complexity, the resulting functional 
shape of the data was interpretable only to a certain degree. That 


is, the complexity of text-characteristic interactions was acknowl- 
edged, but it could not be described in simple ways or with a 
parsimonious set of rules. Whether lack of a final specified state- 
ment detailing local interactions is a failure or a limitation is 
debatable. For those who embrace complexity theory, tensions 
between chaos and parsimony, between complexity and simplicity 
are natural—they exist in the natural world, and attempts to over- 
specify distort reality. 

Second, text selection for study was extremely important. The 
population of classroom texts should be broadly represented. 
While every attempt was made to accomplish broad representation, 
the texts selected for the study may set boundaries on the gener- 
alizability of findings, and readers of the study should draw their 
own conclusions about the text representation. 

A third limitation is that a traditionalist statistician working in 
the fields of psychology or education might consider the process of 
trimming variables awkward or imprecise. Lacking statistical es- 
timation of variable “significance,” logical analysis was necessary. 
Some may question the reliability of the logical analysis. Cer- 
tainly, when such methodology is used, it is critical that detailed 
description is provided so that readers may glean whether conclu- 
sions are warranted. 

A fourth possible limitation is that because pictures could not be 
analyzed digitally, the role of pictures in early-grades text com- 
plexity was not directly assessed. However, pictures were indi- 
rectly involved in that they were present in both the teacher and 
student substudies for creation of the text-complexity metric. 

Implications for practice. One major practical implication of 
the present results is that educators should consider discourse-level 
text characteristics in early-grade readers perhaps more than is the 
current case. Some researchers and teacher educators advocate that 
educators should account for text “organization” (e.g., Shanahan, 
Fisher, & Frey, 2012), or in the case of Coh-Metrix, discourse- 
level features such as cohesion (Graesser et al., 2011), when 
assigning texts to students. Given that “code-cracking” is prevalent 
during the early-grades, it is likely that in everyday classroom 
instruction, word-level characteristics are favored, and discourse- 
level text characteristics may be given short shrift. Instead, atten- 
tion to discourse-level features such as repetition, redundancy, and 
patterning would appear to be in order. 

As well, few teacher educators or researchers espouse the sig- 
nificance of the interplay among text characteristics for text com- 
plexity in general, even above the early grades. While the impor- 
tant text characteristics often, if not typically, make unique 
contributions to text complexity, in many texts, their interplay is 
equally important, if not more important. Consequently, it is crit- 
ical that, when selecting texts for young children, educators con- 
sider ways in which characteristics can modulate one another’s 
challenges. For example, presence of repetition, redundancy, and 
patterning can ease reading progress for children when texts have 
somewhat challenging word structures and/or word meanings. In 
light of evidence that present-day core-reading programs tend to 
have somewhat difficult vocabulary (Foorman et al., 2004), teach- 
ers might particularly observe degrees of repetition and patterning 
in core readers and provide additional instructional support for 
students as needed. 

The finding of more variability in lower text-complexity texts 
than in higher ones was interesting in that some might antici- 
pate the opposite—less variability (more control over) the char- 
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acteristics for students who are just beginning to learn to read, 
with more variability (less control over) characteristics as students 
advance their reading ability. Educators might need to consider the 
lowest level texts especially carefully when choosing texts for 
students’ independent reading versus for instructional settings 
where teachers can provide more support. 

Finally, publishers of early-grades texts should account for 
multiple text characteristics when creating and/or leveling early- 
grades texts. Some current-day leveling systems that are com- 
monly used by publishers and/or classroom teachers, such as 
Fountas and Pinnell’s (2012) system, do take into account text 
characteristics at multiple linguistic levels, but many publishers 
rely solely on measurement of word frequency and sentence 
length. While the latter two factors can be useful for many reasons, 
creation of optimal texts that ease young students’ reading growth 
and use of optimal leveling systems likely requires consideration 
of a wider gamut of early-grades text characteristics. 

Implications for future research. The present findings lend 
credence to a complexity theory of early-grades texts. One chal- 
lenge for future research is further exploration of potential classes 
of early-grades texts where, within class, selected ensembles of 
characteristics condition one another in similar ways. If such 
classes of texts are identifiable, through professional development 
sessions, educators might come to a fuller understanding of the 
importance of selecting texts with certain characteristics to en- 
hance particular cognitions as students begin to learn to read. 

The results of the present work suggest that a tool, an automated 
analyzer, could be created from the final nine-variable predictor 
model using random forest regression. The development of such a 
tool could be potentially useful to researchers who are interested in 
evaluating existing reading materials or to guide the development 
of new materials. 

Finally, the present text-complexity model of text characteristics 
might also be used in intervention efforts. Texts could be theoret- 
ically configured as “best texts to facilitate young children’s read- 
ing progress.” Then in a controlled comparison-group intervention 
design, children’s reading progress could be examined when read- 
ing instruction occurs with such texts compared to other classes of 
texts that exist widely in current-day classrooms. 
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Research shows that multiple external representations can significantly enhance students’ learning. Most 
of this research has focused on learning with text and 1 additional graphical representation. However, real 
instructional materials often employ multiple graphical representations (MGRs) in addition to text. An 
important open question is whether the use of MGRs leads to better learning than a single graphical 
representation (SGR) when the MGRs are presented separately, 1-by-1 across consecutive problems, 
accompanied by text and numbers. A further question is whether providing support for students to relate 
the different representations to the key concepts that they depict can enhance their benefit from MGRs. 
We investigated these questions in 2 classroom experiments that involved problem solving practice with 
an intelligent tutoring system for fractions. Based on 112 sixth-grade students, Experiment | investigated 
whether MGRs lead to better learning outcomes than 1 commonly used SGR, and whether this effect can 
be enhanced by prompting students to self-explain key concepts depicted by the graphical representa- 
tions. Based on 152 fourth- and fifth-grade students, Experiment 2 investigated whether the advantage of 
MGRs depends on the specific represeniation chosen for the SGR condition because prior research 
suggests that some SGRs might promote learning more than others. Both experiments demonstrate that 
MGRs lead to better conceptual learning than an SGR, provided that students are supported in relating 
graphical representations to key concepts. We extend research on multiple external representations by 
demonstrating that MGRs (presented in addition to text and 1-by-1 across consecutive problems) can 
enhance learning. 


Keywords: multiple representations, self-explanation prompts, intelligent tutoring systems, fractions, 
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In the educational psychology literature, there is substantial 
evidence that multiple external representations can enhance stu- 
dents’ learning, compared to a single external representation (Ain- 
sworth, Bibby, & Wood, 2002; Schnotz & Bannert, 2003). By and 
large, the educational psychology literature on learning with mul- 
tiple external representations has focused on learning with text and 
one additional graphical representation (e.g., Ainsworth & Loizou, 
2003; Bodemer, Pl6tzner, Bruchmiiller, & Hacker, 2005; Eitel, 
Scheiter, & Schiiler, 2013; Schnotz & Bannert, 2003). The advan- 
tage of learning with text and a graphical representation (compared 
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to learning with text alone) has been attributed to the fact that 
multiple external representations stimulate deeper processing by 
requiring learners to integrate information across different repre- 
sentations. However, instructional materials found in real educa- 
tional settings typically contain multiple graphical representations 
in addition to text (van Someren, Boshuizen, & de Jong, 1998), for 
instance in math (e.g., Arcavi, 2003), chemistry (e.g., Kozma & 
Russell, 2005), biology (e.g., Cook, Wiebe, & Carter, 2008), 
physics (e.g., van der Meij & de Jong, 2006), engineering (e.g., 
Nathan, Walkington, Srisurichan, & Alibali, 2011), and program- 
ming (e.g., Kordaki, 2010). Multiple graphical representations are 
typically used because each individual graphical representation 
emphasizes a subset of the domain-relevant information and the 
different graphical representations therefore provide complemen- 
tary information. For this reason, students are often provided with 
multiple graphical representations in addition to text and numbers, 
to provide the information necessary to form a coherent mental 
model of the domain. 

We are not aware of experimental studies that investigate 
whether multiple graphical representations (MGRs) lead to better 
learning than a single graphical representation (SGR), when each 
is provided in addition to text and numbers, even though educa- 
tional psychology research is often conducted in the context of 
materials that contain MGRs. For example, Nathan et al. (2011) 
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observed students’ ability to coordinate between MGRs across 
STEM courses via gesture. Kozma and Russell (2005) observed 
chemistry experts’ use of MGRs as they work in chemistry labo- 
ratories. Other related research has focused on learning with ani- 
mations. For example, Scheiter, Gerjets, Huk, Imhof, and Kam- 
merer (2009) contrasted students’ learning from a single type of 
animations (either realistic or schematic) to multiple types of 
animations (both realistic and schematic). This study did not show 
an advantage of learning with multiple types of representations, 
possibly because one type of animation had overwhelmingly pos- 
itive effects on all target concepts. In other words, the different 
types of animations failed to produce complementary benefits on 
students’ learning, so that multiple types of animations were not 
more effective than a single type of animation. Taken together, 
these prior studies do not systematically investigate whether in- 
struction that uses MGRs is more effective than instruction that 
uses only an SGR. In other words, in spite of the widespread use 
of MGRs, it remains an open question whether prior research on 
learning with text and graphic generalizes to learning with multiple 
graphical representations, when each is presented in addition to 
text and numbers. 

There are many possible ways of presenting MGRs to students: 
We can present them concurrently (i.e., presenting more than one 
graphical representation at the same time) or consecutively (i.e., 
presenting one graphical representation at a time, but different 
graphical representations across a sequence of problems). In this 
article, we focus on the question whether MGRs lead to better 
learning than an SGR when presented consecutively across a 
sequence of problems. Presenting MGRs consecutively is a logical 
next step when extending research on learning with multiple 
external representations to learning with multiple graphical repre- 
sentations: It allows us to investigate whether increasing the num- 
ber of graphical representations leads to better learning without 
increasing the number of connections students have to make be- 
tween representations (because in each problem, students can only 
connect one graphical representation to text and numbers but 
cannot connect different graphical representations to one another). 
By contrast, concurrent presentation of MGRs may place very high 
demands on students’ cognitive load due to many possible con- 
nections between the different graphical representations, text, and 
numbers. Furthermore, consecutive presentation of MGRs is com- 
mon practice in many educational materials. The central hypoth- 
esis of this article is that MGRs, when presented consecutively, 
will enhance students’ learning because students can form a more 
accurate mental model of the domain knowledge by gradually 
refining it across a sequence of problems. This refinement process 
is facilitated by MGRs because they emphasize complementary 
conceptual aspects of the domain knowledge. 


Theoretical Perspectives on Learning With Multiple 
External Representations 


Current theoretical frameworks for learning with multiple ex- 
ternal representations do not address the question of whether 
MGRs lead to better learning than an SGR. According to the 
cognitive theory of multimedia learning (CTML; Mayer, 2003, 
2005), the advantage of learning with text and graphic can be 
attributed to the more efficient use of working memory capacity 
and to deeper conceptual integration of the learning material. The 


CTML assumes that verbal and pictorial information are processed 
in different information channels. Even though text is often pre- 
sented visually (i.e., in written form), it is encoded into a verbal 
model within working memory, whereas pictorial information is 
encoded visually. Because the capacity of each part of working 
memory (a visual channel and a verbal channel) is limited but 
additive (Sweller, van Merrienboér, & Paas, 1998), learning with 
both text and graphic makes better use of the learner’s working 
memory capacity than learning with text alone. Furthermore, ac- 
tive integration of the verbal model and the pictorial model into 
one coherent mental model requires deep conceptual processing of 
the content, which enhances learning, compared to text alone. 

The CTML does not specifically address the use of MGRs 
presented consecutively, accompanied by text. Learning materials 
with an SGR or with MGRs (when each is presented individually 
in addition to text and numbers) both use the same number of 
information channels. The CTML might predict that MGRs lead to 
cognitive overload in the pictorial part of working memory if they 
were presented concurrently, which is known to hamper learning 
(Sweller et al., 1998). On the other hand, based on the CTML, one 
might predict that MGRs will be more effective than an SGR when 
they are presented consecutively by enhancing active integration 
and deeper conceptual processing of the learning content, because 
students may gradually refine their mental model based on the 
different conceptual aspects that each graphical representation 
emphasize. Specifically, each time students encounter a different 
graphical representation, they may incorporate the conceptual per- 
spective emphasized in this graphical representation into their 
mental model of the learning content. 

A second relevant theory, the integrated model of text and 
picture comprehension (ITPC; Schnotz, 2005; Schnotz & Bannert, 
2003), also does not explicitly address the use of MGRs. Under 
this theory, the advantage of learning with multiple external rep- 
resentations stems from the fact that text and graphic lead to 
different internal representations. During mental model formation, 
students integrate these internal representations via structure map- 
ping (Gentner, 1983). The resulting deep integration across repre- 
sentations accounts for the effectiveness of text and graphic over 
text alone. Based on the ITPC, one might predict that MGRs 
enhance learning because they provide complementary informa- 
tion, which allows students to form more accurate mental models. 
However, the ITCP also states that understanding each represen- 
tation creates cognitive costs (Schnotz, 2005). Another possible 
interpretation of the ITPC is therefore that MGRs do not lead to 
better learning than an SGR as the costs of understanding each 
graphical representation may not outweigh the benefit of integrat- 
ing complementary information into a mental model. 

Taken together, neither the CTML nor the ITPC make spe- 
cific predictions as to whether MGRs lead to better learning 
than an SGR, even though both theoretical frameworks are 
consistent with the idea that MGRs might enhance learning by 
allowing students to form more sophisticated mental models of 
domain-relevant concepts. Therefore, the goal of the present 
article is to close the gap between prior educational psychology 
research that has mainly focused on learning with text and 
graphic and the common practice to include MGRs, in addition 
to text-based representations. 
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Multiple Graphical Representations of Fractions 


We address this question in the important but challenging do- 
main of fractions learning, one of the many domains in which 
MGRs (e.g., circles, rectangles, and number lines) are used exten- 
sively (National Mathematics Advisory Panel [NMAP], 2008). In 
spite of the extensive use of MGRs in fractions instruction, the 
advantage of MGRs of fractions over an SGR has not yet been 
systematically investigated. Commonly used U.S. middle-school 
mathematics curricula (e.g., Bennett, 2004; Fitzgerald, Lappan, & 
Fey, 2004; Hake, 2004) employ a wide variety of graphical rep- 
resentations of fractions, such as area models (e.g., circles, rect- 
angles), linear models (e.g., number lines, or liquid container 
models), and discrete models (e.g., sets of objects). These graph- 
ical representations emphasize different conceptual interpretations 
of fractions, such as fractions as parts of a whole in area models, 
fractions as measurements in linear models that show fractions as 
a segment of a length that can be measured, and fractions as ratios 
in discrete models that depict fractions as a subset of objects with 
a certain property out of a larger set of objects (Charalambous & 
Pitta-Pantazi, 2007). The use of multiple graphical representations 
of fractions in instructional materials has been shown to be effec- 
tive in observational studies (e.g., Moss & Case, 1999) and in case 
studies (e.g., Kafai, Franke, Ching, & Shih, 1998). However, these 
studies did not systematically compare learning with an SGR to 
learning with MGRs. Thus, in addition to being of theoretical 
relevance, the choice of fractions as a domain for our study 
enhances the practical relevance of this research. 

We investigated these questions within the context of the Frac- 
tions Tutor (Rau, Aleven, Rummel, & Rohrbach, 2013). The 
Fractions Tutor is a type of Cognitive Tutor (Koedinger & Corbett, 
2006), namely, an example-tracing tutor (Aleven, McLaren, 
Sewall, & Koedinger, 2009). Like all Cognitive Tutors, it supports 
tutored problem solving, providing step-by-step guidance as stu- 
dents solve complex problems. The use of a Cognitive Tutor as 
research platform is attractive because Cognitive Tutors support 
learners in a common instructional scenario, namely, problem- 
solving practice, they have a proven track record in improving 
students’ mathematics achievement (Koedinger & Aleven, 2007; 
Pane, Griffin, McCaffrey, & Karam, 2013), and they are being 
used in a large number of classrooms across the United States, 
about 600,000 students yearly (Koedinger & Corbett, 2006). Fur- 
thermore, Cognitive Tutors can support interactive graphical rep- 
resentations of fractions and provide targeted feedback on stu- 
dents’ use of these interactive representations as they solve 
fractions problems. 


Scaffolding Learning With Multiple Representations 
Through Self-Explanation Prompts 


Providing learners with multiple representations does not auto- 
matically result in better learning (Ainsworth et al., 2002). As 
discussed, in order to benefit from multiple representations, learn- 
ers must conceptually understand how the different representations 
depict key concepts and integrate these concepts into a coherent 
mental model of the domain (Ainsworth, 2006; de Jong, et al., 
1998). However, students tend not to spontaneously engage in 
such sense-making activities (Ainsworth, 2006; Yerushamly, 
1991), which can hamper their learning of domain knowledge 


(Ainsworth et al., 2002; de Jong, et al., 1998; Gobert et al., 2011; 
Gutwill, Frederiksen, & White, 1999; van der Meij & de Jong, 
2006). 

Several studies lead to the hypothesis that integration processes 
involved in learning with multiple representations can happen 
through self-explanation activities. Self-explanation activities are 
explanations to oneself that elaborate information provided in the 
learning materials, make connections to prior knowledge, and 
refine mental models (Chi, Bassok, Lewis, Reimann, & Glaser, 
1989; Wylie & Chi, in press). Research shows that students who 
generate a larger number of high-quality self-explanations show 
the largest benefits from multiple external representations (Ain- 
sworth & Loizou, 2003). However, students tend not to spontane- 
ously engage in high-quality self-explanation activities (Ainsworth 
& Loizou, 2003). This observation leads to the hypothesis that 
prompting students to self-explain how different representations 
relate to the conceptual aspects of the domain may further enhance 
their benefit from multiple representations. Berthold, Eysink, and 
Renkl (2009) found that prompting students to self-explain while 
studying multirepresentational learning materials had a positive 
effect on both conceptual and procedural knowledge. Zhang and 
Linn (2011) found positive effects of enhancing student-generated 
explanations while learning with dynamic chemistry representa- 
tions. In a recent review of the effects of self-explanation prompts 
on learning in multimedia environments, Wylie and Chi (in press) 
argued that environments in which students have to relate multiple 
representations, self-explanation prompts can be an effective in- 
structional strategy. However, we are not aware of a systematic 
investigation of the effect of self-explanation prompts on students’ 
benefit from multiple representations compared to a single repre- 
sentation. In line with prior research, we expect that prompting 
students to self-explain will increase the likelihood that they will 
engage in deeper sense-making activities with MGRs, thus increas- 
ing their benefit from MGRs. 

In our research, we use a focused form of self-explanation 
support; namely, menu-based prompts. In complex multimedia 
environments, such focused forms of supporting self-explanation 
have been hypothesized to be more effective than traditional 
open-ended approaches (Wylie & Chi, in press). Supporting self- 
explanation by the means of menu-based selections is a type of 
support chosen in many empirical studies with Cognitive Tutors 
(see Aleven & Koedinger, 2002; Atkinson, Renkl, & Merrill, 
2003) and was more effective than open-ended self-explanation 
prompts in several studies (Gadgil, Nokes-Malach, & Chi, 2012; 
Johnson & Mayer, 2010; van der Meij & de Jong, 2011). 


Overview of Experiments 


We conducted two classroom experiments to investigate 
whether MGRs presented consecutively lead to better learning 
than one SGR. In both conditions, each graphical representation 
was accompanied by text and numbers. Further, Experiment 1 
investigates whether self-explanation prompts enhance this hy- 
pothesized advantage of MGRs. 


Experiment 1 


Classroom Experiment 1 investigated, in a 2 X 2 design, the 
effects of learning with MGRs compared to learning with an SGR 
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and the effects of being prompted to self-explain compared to not 
being prompted. In all conditions, students worked with only one 
graphical representation per tutor problem (presented in addition to 
text and numbers). That is, students in the SGR conditions en- 
countered only one graphical representation across all tutor prob- 
lems, namely, the number line. By contrast, students in the MGR 
conditions encountered different graphical representations across 
consecutive tutor problems. We chose the number line represen- 
tation for the SGR conditions because number lines are considered 
a privileged representation that relates to many math concepts, 
such as integers and decimals, and that provides a foundation for 
algebra (Siegler et al., 2010). 
Specifically, we investigated the following hypotheses: 


Hypothesis 1.1: Students who learn with MGRs will outper- 
form students who learn with an SGR on all measures of 
robust learning; namely, reproduction and transfer of concep- 
tual knowledge, and reproduction and transfer of procedural 
knowledge (main effect of number of graphical 
representations). 


Hypothesis 1.2: Students who receive self-explanation 
prompts will outperform students who do not receive such 
prompts on all measures of robust learning (main effect of 
self-explanation prompts). 


Hypothesis 1.3: Students who learn with MGRs will outper- 
form students who learn with an SGR in particular when they 
receive self-explanation prompts on all measures of robust 
learning (interaction between number of graphical represen- 
tations and self-explanation prompts). 


Method 


Participants. One hundred thirty-two sixth-grade students 
from a U.S. middle school participated in the study during regular 
mathematics instruction. The school district was among the 15% 
lowest ranked of 500 Pennsylvania public school districts in the 
school year of 2007/2008. In the school year of 2007/2008, about 
half of all students in the school district were enrolled in free or 
reduced-price lunch programs, roughly two thirds of all students 
were White, and around one third were African American.’ Stu- 
dents were aged 10 to 13 years. 

Fractions tutors. The tutor used in the study included five 
different graphical representations of fractions, shown in Figure 1. 
All students worked through two topics of the Fractions Tutor: 
equivalent fractions and fraction addition (see Appendix for a 
description of the topics covered). Figure 2 shows an example of 


a fraction addition problem with and without self-explanation | 


prompts. As is typical of Cognitive Tutors (e.g., Koedinger & 
Corbett, 2006), the Fractions Tutor provides problem-solving ac- 
tivities while giving error feedback on all steps. Error feedback 
was designed to encourage students to reconsider their answer by 
reminding them of a previously introduced principle, or by pro- 
viding them with an explanation of their error. At any time, 
students could request hints that provided guidance on how to 
solve the next step. 

Test instruments. We assessed students’ knowledge of frac- 
tions three times: prior to the tutoring sessions with a short prior 
knowledge test and twice after the tutoring sessions with equiva- 


lent immediate and delayed posttests. The delayed posttest was 
administered 1 week after the immediate posttest. Two equivalent 
posttest forms were created such that the test items were structur- 
ally the same, but with different numbers. The order of test forms 
was counterbalanced. Test items were adapted from standardized 
state tests and from the fractions literature. We used four measure- 
ment scales to assess students’ robust knowledge, validated by a 
confirmatory factor analysis. The four scales differed in whether 
they tested reproduction or transfer of fractions knowledge, and 
whether they tested conceptual knowledge or procedural know!l- 
edge. The conceptual reproduction scale of the test included equiv- 
alent fractions problems with the same representations used in the 
tutor, and items that required students to draw the graphical rep- 
resentations they had seen in the tutor. Procedural reproduction 
items included equivalent fractions and fraction addition problems 
that were purely symbolic. Conceptual transfer items included 
equivalent fractions problems and identifying fractions problems 
using unfamiliar graphical representations and cover stories not 
covered in the tutor. The procedural transfer scale included frac- 
tion addition problems with unfamiliar graphical representations 
and fraction subtraction problems (subtraction was not covered by 
tutor). Example test items are provided in the Appendix. As the 
purpose of the prior knowledge test was to control for differences 
in students’ prior knowledge rather than to assess students’ learn- 
ing gains, it was a shorter version of the posttests and included 
only reproduction items. The prior knowledge test had 13 items, 
the posttests had 18 items. For questions that required multiple 
steps, partial credit was given for each correct step. The scores 
reported here are relative scores (i.e., ranging from 0 to 1). 
Experimental design. Students were randomly assigned to 
one of four conditions, which varied on two experimental factors: 
number of graphical representations (SGR vs. MGRs) and self- 
explanation prompts (SE vs. noSE). Students in the SE conditions 
were prompted by the tutor to self-explain what aspects of the 
given graphical representations correspond to the concepts of 
numerator and the denominator of the fraction (e.g., “How does the 
number line show the numerator of the fraction?”), or how the 
procedure they performed symbolically corresponds to the manip- 
ulation of the graphical representations (e.g., “How did you con- 
vert the fraction in the circle?”). Students selected their answer 
from a drop-down menu (see Figure 2). Students in the noSE 
conditions received the same tutor problems without the prompts. 
In the SGR conditions, all problems involved an interactive 
number line representation (see Figure 2). In the MGR conditions, 
students worked with the five graphical representations shown in 
Figure 1, which were presented in an interleaved fashion, so that 
only one graphical representation was presented at a time, but 
consecutive problems used different graphical representations. 
Students first solved a fractions problem using the number line. 
They then performed the same steps symbolically. Next, students 
revisited the same problem they had solved with the number line 
with the four remaining graphical representations. Students in the 
SE conditions were asked to reflect on what aspect in the graphical 
representation corresponds to numerator and denominator with 


'The precise numbers are withheld to preserve anonymity of the par- 
ticipating school. 
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Figure 1. Number line, pie chart, rectangle, stack, set (from left to right) as used in the study. 


regard to the steps they previously completed with the number line 
representation. Students in the noSE conditions skipped this step. 

Experimental procedure. On the first day of the study, stu- 
dents completed the prior knowledge test, which took about 20 
min. On the next day, students started working with the tutor. They 
worked with the tutor during their regular mathematics instruction 
in their school’s computer lab for a total of 2.5 hr, spread across 
two consecutive days. Students worked at their own pace, but time 
spent with the tutor was held constant across experimental condi- 
tions, so that students completed as many tutor problems as they 
could in the available time. Immediately after finishing the work 
on the tutor, students completed the immediate posttest, which 
took about 30 min. Six days later, students completed the delayed 
posttest. 


Results 


We excluded students from the analysis if they had been absent 
for at least two study days, if they were statistical outliers on test 
performance (i.e., if they performed more than two standard de- 
viations better or worse than their classmates on both the imme- 
diate and the delayed posttest), or if they were statistical outliers 
with respect to the time they spent on the Fractions Tutor (i.e., if 
the time they spent on the tutor was over two standard deviations 
more or less than average due to absenteeism or unsupervised 
work with the tutor outside of class). Data from N = 112 students 
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were included in the data analysis (n = 29 in the SGR-noSE 
condition, n = 29 in the SGR-SE condition, n = 28 in the 
MGR-noSE condition, and n = 26 in the MGR-SE condition). The 
number of excluded students did not differ between conditions, 
x7(3, N = 21) < 1. There were no significant differences between 
conditions on the prior knowledge test (F < 1). However, since 
scores on the prior knowledge test correlated significantly with 
overall performance in the immediate and delayed posttest (ps < 
.01), we include the prior knowledge test as a covariate in subse- 
quent analyses. 

We follow Cohen (1988) and consider an effect size partial y* 
of .01 to be a small effect, .06 a medium effect, and .14 a large 
effect. Similarly, we consider an effect size d of .20 to be a small 
effect, .50 a medium effect, and .80 a large effect. All p-values for 
post hoc comparisons were adjusted using the Bonferroni correc- 
tion. 

We conducted repeated-measures analyses of covariance 
(ANCOVAs) with students’ scores on the prior knowledge test as 
a covariate, immediate and delayed posttest scores as dependent 
variables and number of representations and _ self-explanation 
prompts as independent variables. In addition, we computed a 
priori contrasts on the effect of number of representations within 
the SE conditions and within the noSE conditions to clarify the 
predicted interaction effect. To clarify the results from the 
ANCOVAs, we used post hoc comparisons. Adjusted means and 
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Figure 2. Fraction addition with the number line representation, without self-explanation prompts (left) versus 


with self-explanation prompts (right). 
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standard deviations can be found in the Appendix. Table 1 gives an 
overview of the main effects and interaction effects. Table 2 shows 
the results from a priori contrasts and post hoc comparisons. 

To investigate Hypothesis 1.1 (that students who learn with 
MGRs will outperform students who learn with an SGR), we 
computed the main effect for number of graphical representations. 
We found no main effect for the number of graphical representa- 
tions on any knowledge type (F's < 1). To examine Hypothesis 1.2 
(that students benefit from self-explanation prompts), we com- 
puted the main effect of self-explanation prompts. We found a 
significant main effect in favor of the SE conditions on conceptual 
reproduction, F(1, 108) = 9.13, p < .01, partial n? = .08, but not 
with respect to other knowledge types. Finally, we investigated 
Hypothesis 1.3 (that self-explanation prompts enhance students’ 
benefit from MGRs) by computing the interaction effect between 
number of graphical representations and self-explanation prompts. 
As expected, we found significant interaction effects between the 
number of graphical representations and self-explanation prompts 
on conceptual reproduction, F(1, 108) = 13.02, p < .01, and 
procedural transfer, F(1, 108) = 11.35, p < .01. To better under- 
stand the interaction effect, we computed a priori contrasts. Within 
the SE conditions, we found a significant advantage of the 
MGR-SE condition over the SGR-SE condition on procedural 
transfer at the immediate posttest, 1/108) = 2.01, p < .05,d = 
0.73, and marginally significant effects on conceptual reproduction 
at the immediate posttest, (108) = 1.58, p < .10, d = 0.44, and 
the delayed posttest, (108) = 1.53, p < .10, d = 0.44, and on 
conceptual transfer at the delayed posttest, (108) = 1.64, p < .10, 
d = 46. Within the noSE conditions, there was a significant 
advantage of the SGR-noSE condition over the MGR-noSE con- 
dition on procedural transfer at the immediate posttest, 1/108) = 
2.80, p < .01, d = 0.99, but not on any other knowledge type. 

To find out why there was no overall advantage of self- 
explanation prompts for knowledge types other than conceptual 
reproduction, we used post hoc comparisons. Within the MGR 
conditions, we found a significant advantage for self-explanation 
prompts on conceptual reproduction at the immediate and delayed 
posttests (ps < .01), conceptual transfer at the delayed posttest 
(p < .05), and procedural transfer at the immediate posttest (p < 
.01), and a marginal advantage of self-explanation prompts on 
procedural reproduction at the delayed posttest (p < .10). Within 
the SGR conditions, there were no significant effects of self- 
explanation prompts. 


Table | 


Overview of Main Effects and Interaction Effects From Experiment 1 


Finally, we used post hoc comparisons to contrast the two most 
successful conditions: SGR-noSE and MGR-SE. We found mar- 
ginal differences on conceptual reproduction and conceptual trans- 
fer at the delayed posttest (ps < .10) in favor of the MGR-SE 
condition. 


Discussion 


Our results do not support Hypothesis 1.1, that students who 
learn with MGRs will outperform students who learn with an SGR 
regardless of whether self-explanation prompts are provided. They 
provide partial support for Hypothesis 1.2, that students who 
receive self-explanation prompts outperform students who do not 
receive such prompts. The results support Hypothesis 1.3, that 
there is an interaction between the number of graphical represen- 
tations and self-explanation prompts, such that students benefit 
from MGRs in particular when they are prompted to self-explain 
the relation between the graphical representations and key con- 
cepts of fractions (see Tables 1 and 2). 

Let us first consider the effect of number of representations. Our 
results do not support Hypothesis 1.1: There was no main effect of 
number of representations (SGR vs. MGRs). Our experiment does 
not confirm that MGRs overall lead to better learning than SGRs. 
However, in line with Hypothesis 1.3, the results suggest that 
MGRs enhance learning compared to an SGR when they are 
accompanied by self-explanation prompts. First, the MGR-SE 
condition outperformed the SGR-SE condition on conceptual re- 
production and conceptual transfer, although the difference was 
only marginally statistically significant. Second, the MGR-SE 
condition outperformed the SGR-noSE condition on conceptual 
reproduction and conceptual transfer. Thus, the effect of MGRs 
combined with self-explanation prompts over the SGR conditions 
was specifically found on conceptual knowledge. The evidence is 
somewhat tentative: Although the interaction effect was statisti- 
cally significant, some of the post hoc comparisons interpreting 
this effect were only marginally statistically significant. 

Why might MGRs not lead to an overall difference on concep- 
tual knowledge, compared to an SGR (Hypothesis 1.1)? In line 
with the literature on learning with multiple external representa- 
tions (Ainsworth & Loizou, 2003), it may be that learning with 
MGkRs, especially when they emphasize somewhat disparate con- 
ceptual viewpoints, will not be effective unless crucial sense- 
making processes are supported. Self-explanation prompts appear 





Conceptual knowledge 


Procedural knowledge 





Reproduction Transfer Reproduction Transfer 
Effect Direction Significance Direction Significance Direction Significance Direction Significance 

Main effect of multiple vs. 

single representations ns ns ns ns 
Main effect of self-explanation 

prompts SE > noSE pao! ns ns ns 
Interaction between these two 

factors (see Table 2) p<.0l ns ns (see Table 2) p< .0l 


Note. SE = self-explanation. 
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Table 2 


Overview of A Priori Contrasts and Post Hoc Comparisons From Experiment 1 


Procedural knowledge 


Conceptual knowledge 


Transfer Reproduction Transfer 


Reproduction 


Direction Significance Direction Significance Direction Significance Direction Significance 


Posttest time 


Effect 





p< .05 


MGR-SE > 


ns ns 


p<.10 


MGR-SE > 


Immediate 


Effect of number of representations 


SGR-SE 


SGR-SE 
MGR-SE > 


with self-explanation prompts 


ns 


ns 


p<.10 MGR-SE > p<.10 
SGR-SE 


SGR-SE 


Delayed 


ns ns SGR-noSE > p<.0l 


ns 


Immediate 


Effect of number of representations 


MGR-noSE 


without self-explanation prompts 


ns ns ns ns 
p<.01 


p<.0l 


Delayed 


MGR-SE > 


ns ns 


MGR-SE > 


Immediate 


Effect of self-explanation prompts 


MGR-noSE 


MGR-noSE 
MGR-SE > 


when working with an MGR 


ns 


p<.01 MGR-SE > p <.05 MGR-SE > p<.10 
MGR-noSE MGR-noSE 


MGR-noSE 


Delayed 


ns ns ns ns 


Immediate 
Delayed 


Effect of self-explanation prompts 
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ns ns ns ns 


when working with an SGR 


MGR-SE vs. SGR-noSE 


ns ns ns ns 
p~<.10 


p<.10 


Immediate 
Delayed 


ns 


ns 


MGR-SE > 
SGR-noSE 


MGR-SE > 
SGR-noSE 


MGR = multiple graphical representation; SE = self-explanation, SGR = single graphical representation. 


Note. 


to be one effective means to do so. This interpretation is in line 
with the CTML (Mayer, 2005) and the ITPC (Schnotz, 2005), 
which state that additional representations may create cognitive 
costs, and that the advantage of multiple representations depends 
on students’ ability to integrate them into a coherent mental model. 
Our results indicate that self-explanation prompts are a successful 
means to help students overcome potential costs of MGRs. 

Contrary to our hypotheses, we do not find a lasting advantage 
of MGRs over SGRs on most measures of procedural knowledge. 
We found significant differences in favor of MGRs on procedural 
transfer, but this advantage was of temporary nature (i.e., it oc- 
curred on the immediate posttest but not on the delayed posttest). 
The finding that MGRs promote conceptual learning but not (or to 
a lesser extent) procedural learning is consistent with a view of 
MGRs as providing complementary conceptual viewpoints. Ap- 
parently, MGRs play a lesser role in supporting students’ ability to 
apply a known procedure to solve a familiar task type. MGRs may 
not help students to perform a procedure per se, but rather, in 
acquiring flexibility to apply a procedure to multiple situations, as 
supported by the advantage of MGRs (with self-explanation 
prompts) on procedural transfer. 

Taken together, the results from Experiment 1 extend previous 
research on learning with multiple representations. Most prior 
research has focused on learning with multiple external represen- 
tations (e.g., text and graphic). Extending this prior research, we 
demonstrate benefits of using MGRs compared to an SGR, when 
MGRs are presented one-by-one across problems. Although the 
evidence is not uniformly strong, overall, a fair summary of the 
evidence is that students benefit from MGRs, compared to an 
SGR, provided they are prompted to self-explain how the repre- 
sentations relate to key concepts in the domain. The benefit of 
learning with MGRs over learning with an SGR was particularly 
pronounced for conceptual knowledge and persisted until at least 1 
week after the intervention. 

Several open questions arise from Experiment 1. First, one may 
ask whether the advantage MGRs over an SGR were due to the 
choice of graphical representation for the SGR group, namely, the 
number line. Although the number line considered a powerful 
representation for fractions (Siegler et al., 2010), it is also the 
representation students struggle with most (NMAP, 2008). Area 
models (i.e., circles and rectangles) are considered to be more 
intuitively accessible (Cramer, Wyberg, & Leavitt, 2008). It is 
therefore possible that students will benefit equally from a version 
of the Fractions Tutor that contains only a circle or only a rectangle 
representation as from the MGR version. Second, one might ask 
whether having students in the MGR conditions revisit the same 
numerical problems is pedagogically realistic. Might MGRs lead 
to even better learning if they were presented across different 
numerical problems? 


Experiment 2 


We conducted a second classroom experiment to investigate 
whether the advantage of learning with MGRs over learning with 
an SGR (when provided with self-explanation prompts) is due to 
the specific SGR used in Experiment 1, as well as to test whether 
the advantage of MGRs over an SGR could be replicated with an 
updated version of the Fractions Tutor that includes a more com- 
prehensive curriculum and in which all numerical problems are 
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different (i.e., no repeats as in Experiment 1). We included three 
SGR-SE conditions that use only a number line, only a circle, or 
only a rectangle. We included self-explanation prompts in all 
conditions, given the conclusion from Experiment 1 that self- 
explanation prompts enhance students’ learning from MGRs. 

Specifically, we investigate the following hypothesis: Working 
with the MGR-SE version of the tutor leads to higher learning 
gains than working with the SGR-SE version on all measures of 
robust learning. 


Method 


Participants. Two hundred and fifty-nine fourth- and fifth- 
grade students from six different schools in three school districts 
(31 classes) participated in the study. The schools’ rankings in the 
school year of 2009/2010 were in the top 10% of 2,468 Pennsyl- 
vania public schools.” In the school year of 2009/2010, 10%-30% 
of the students in the participating school districts were enrolled in 
free or reduced-price lunch programs, over 90% were White, less 
than 5% African American. 

Fractions tutor. We revised the Fractions Tutor in line with 
our goal to have a more comprehensive tutor curriculum (Rau et 
al., 2013; see the Appendix). An important change in the Fractions 
Tutor regards the choice of graphical representations. We decided 
to include the number line representation, the circle, and the 
rectangle representation (see Figure 3) but to exclude the set 
representation. 

We excluded the set representation from the Fractions Tutor 
because the new version covered several topics (e.g., improper 
fractions, fraction addition) in which the use of sets is not advis- 
able from an educational standpoint, and our experimental design 
required combining each representation with all topics. 

As in Experiment 1, the Fractions Tutor provided problem- 
solving support in the form of error feedback and hints. In a 
spiraling curriculum, it covered a sequence of topics three times 
(see Appendix). As mentioned, self-explanation prompts were 
included in each tutor problem to help students reflect on the 
conceptual aspects demonstrated by the graphical representation. 

Test instruments. We assessed students’ knowledge of frac- 
tions three times: immediately before and after using the tutor, and 
a week later. We created three equivalent test forms (i.e., forms 
with structurally identical test items that use different numbers) 
and counterbalanced the order in which they were administered. 
We made changes to the test used in Experiment 1 in accordance 
to the changes made to the Fractions Tutor (i.e., addition of tutor 
topics and choice of graphical representations). The test consisted 
of 18 items, each of which was worth one point. For questions that 
required multiple steps, partial credit was given for each correct 
step. The scores reported here are relative scores (i.e., ranging 
from 0 to 1). The theoretical structure of the test resulted from a 
factor analysis performed on the pretest data. Four knowledge 
types were identified through this factor analysis: reproduction 
with area models (i.e., problems that involved circles and rectan- 
gles), reproduction with the number line, conceptual transfer and 
procedural transfer. Both reproduction scales included identifying 
fractions given a graphical representation, making a graphical 
representation given a symbolic fraction, and recreating the unit 
given a graphical representation of fractions. Conceptual transfer 
items included proportional reasoning questions with and without 


graphical representations. Procedural transfer items included com- 
parison questions with and without graphical representations. Ex- 
ample test items for each scale are provided in the Appendix. 

Experimental design. Students were randomly assigned to 
either the MGR-SE condition or the SGR-SE condition. Within the 
SGR-SE condition, we randomly assigned students to either a 
number-line-only, rectangle-only, or circle-only version of the 
Fractions Tutor. Students in the MGR-SE condition worked with 
all three graphical representations in a one-per-problem fashion. 
Within the MGR-SE condition, we randomly assigned students to 
one of six possible orders of graphical representations (i.e., number 
line-rectangle—circle, number line—circle-rectangle, rectangle— 
number line—circle, rectangle—circle-number line, circle-number 
line-rectangle, or circle-rectangle-number line) to counterbalance 
potential order effects. As mentioned, the different graphical rep- 
resentations in the MGR-SE group were presented in an inter- 
leaved fashion. This experiment included a number of additional 
conditions, as reported elsewhere (Rau, Rummel, Aleven, Pacilio, 
& Tunc-Pekkan, 2012), in which we presented MGRs in different 
sequences. For the purpose of investigating the advantage of 
MGRs over SGRs, we chose the condition with an interleaved 
sequence because it corresponds closest to the procedure in Ex- 
periment 1. 

Experimental procedure. On the first study day, students 
completed a pretest, which took about 30 min. On the next day, 
students started working with the Fractions Tutor. As in Experi- 
ment 1, students worked on the Fractions Tutor in the computer lab 
at their schools for a total of 5 hr during their regular mathematics 
instruction for five to six consecutive school days (depending on 
the school’s class periods). Students worked with the tutor at their 
own pace, but the time students spent with the tutor was held 
constant across classrooms and across experimental conditions, 
such that the number of problems each student completed was 
allowed to differ between students. On the day following the 
tutoring sessions, students took the immediate posttest, which took 
about 30 min to complete. Seven days after the posttest, students 
completed an equivalent delayed posttest. 


Results 


We excluded students who did not encounter all topics covered 
by the Fractions Tutor. As the Fractions Tutor looped through the 
sequence of topics three times, we had to exclude students who 
completed less than 33% of all tutor problems to ensure that all 
students had encountered each topic of the Fractions Tutor at least 
once. We further excluded students who missed at least one test 
day. Due to relatively high rates of absenteeism during regular 
class time, this results in a total of N = 152 with n = 71 students 
in the MGR-SE condition and n = 81 students in the SGR-SE 
condition (thereof n = 26 in the number-line-only condition, n= 
25 in the rectangle-only condition, and n = 30 in the circle-only 
condition). There was no significant difference between the SGR 
conditions with respect to the number of students excluded (x? < 
1), or between the SGR and MGR conditions, y7(1, N = 259) = 
1.579, p > .10. There were no significant differences between 
students who were included or excluded on any dependent mea- 


? The precise numbers are withheld to preserve anonymity of the par- 
ticipating schools. 
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Figure 3. Interactive representations used in fractions tutor: circle, rectangle, and number line. 


sure at the pretest (ps > .10). There were also no significant 
differences between conditions at pretest for any dependent mea- 
sure (ps > .10). 

We used two different sets of analysis of variance (ANOVA) 
models to test our hypothesis. To analyze differences in students’ 
learning gains from working with the Fractions Tutor, we com- 
puted ANOVAs with time of measurement (pretest, immediate 
posttest, and delayed posttest) as within-subject factor, number of 
representations (SGR vs. MGRs) as between-subjects factor, and 
number of representations by test time as an interaction factor. 
Within this ANOVA model, we conducted pairwise comparisons 
for the learning gains from the pretest to the immediate posttest 
and from the pretest to the delayed posttest, separately for each 
condition. Table 3 provides an overview of the main effects and a 
priori comparisons for the ANOVA model. To analyze the differ- 
ences between conditions, we computed ANCOVAs with number 
of representations as between-subjects factor, posttest time (im- 
mediate and delayed posttest) as within-subject factor, pretest 
scores as covariates and the immediate and the delayed posttest as 
repeated, dependent measures. Within these ANCOVA models, we 
used a priori contrasts to compare the MGR-SE and SGR-SE 
conditions at the immediate posttest and at the delayed posttest 
separately. For each model, dependent measures were students’ 
scores on the pretest, the immediate posttest, and the delayed 
posttest on reproduction with number lines, reproduction with area 
models, conceptual transfer, and procedural transfer, respectively. 
All reported p-values were adjusted using the Bonferroni correc- 
tion. Table 4 shows the results from the ANCOVA model. We 
provide the estimated means and standard deviations by condition 
and test time in the Appendix. 

Learning effects. We first explored overall learning gains 
using the ANOVA model, computing the main effect of test time 
on students’ scores at the pretest, the immediate posttest, and the 
delayed posttest. The effect of test time was significant on repro- 
duction with area models, F(2, 445) = 3.59, p < .05, partial ‘i 
.01, reproduction with the number line, F(2, 445) = 9.25, p < .01, 
partial y? = .04, and conceptual transfer, F(2, 445) = 4.55, p < 
.01, partial n? = .02, such that students’ scores were higher on the 
posttests than on the pretest. 

Differences between conditions. As a first step, we tested 
whether the different SGR-SE conditions (i.e., the number-line- 
only condition, the rectangle-only condition, and the circle-only 
condition) could be treated as one homogenous group. Because 
there were no significant differences between the different 
SGR-SE conditions at the posttests on any dependent measure 
(ps > .10), we treat them as one collapsed SGR-SE condition in 
the following analyses. 

To investigate the hypothesis that working with the MGR-SE 
version of the tutor leads to higher learning gains than working 


with the SGR-SE version on all measures of robust learning, we 
used the ANCOVA model to compute the main effect of number 
of graphical representations on students’ test scores at the imme- 
diate and delayed posttests with pretest score as a covariate. The 
main effect of number of graphical representations was significant 
on reproduction with the number line, F(1, 445) = 9.02, p < .01, 
partial y* = .02, conceptual transfer, F(1, 445) = 7.01, p < .01, 
partial n? = .02, and marginally significant on procedural transfer, 
F(1, 445) = 3.29, p < .10, partial n* = .01. We did not find an 
interaction between number of graphical representations and post- 
test time (i.e., immediate and delayed posttest) for any dependent 
measure, ps > .10. A priori contrasts comparing the MGR-SE and 
SGR-SE conditions showed a significant advantage for the 
MGR-SE condition on reproduction with the number line both at 
the immediate posttest, (445) = 2.09, p < .05, d = 0.09, and at 
the delayed posttest, (445) = 2.66, p < .01, d = 0.12, as well as 
on conceptual transfer at the delayed posttest, (445) = 2.27, p < 
05, d = 0.10. 

As an alternative test for our hypothesis, we used post hoc 
comparisons within ANOVA model to investigate learning gains 
by condition (see Table 3). Pairwise comparisons showed that 
students in the MGR-SE condition improved significantly from 
pretest to immediate posttest on reproduction with area models, 
(445) = 2.40, p < .05, d = 0.10, reproduction with the number 
line, (445) = 3.16, p < .01, d = 0.14, as well as from pretest to 
delayed posttest on reproduction with the number line, #445) = 
3.80, p < .01, d = 0.17, and conceptual transfer, (445) = 2.63, 
p < .05, d = 0.12, and marginally significantly on reproduction 
with area models, t(445) = 2.05, p < .10, d = 0.09. Students in the 
SGR-SE condition showed marginal improvement from pretest to 
delayed posttest on conceptual transfer, (330) = 2.05, p < .10, 
d = 0.06. 


Discussion 


Our results show that students significantly improved from 
pretest to posttest only when they worked with the MGR-SE 
version of the Fractions Tutor, but not if they worked with the 
SGR-SE versions. We found that, while students in the MGR-SE 
condition show significant and lasting learning gains on most 
dependent measures (Table 3, row “Post hoc comparisons of 
learning gains for MGR-SE condition”), the collapsed SGR-SE 
condition shows marginal learning gains only on conceptual trans- 
fer at the delayed posttest, but not on any other dependent measure 
(Table 3, row “Post hoc comparisons of learning gains for SGR-SE 
condition”). We did not find improvement on procedural transfer 
in either condition. Procedural transfer was assessed with compar- 
ison tasks that required students to convert given fractions to a 
common denominator, or to find benchmarks to compare the 
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fractions to. Although the Fractions Tutor does not provide prac- 
tice with these procedures, we expected that students would ac- 
quire knowledge about the relative size of fractions that they could 
use to solve the procedural transfer tasks. However, the results are 
not in line with this expectation. The procedural transfer tasks may 
have demanded too much of the students in our sample. In other 
words, we believe that the lack of learning gains on procedural 
transfer is due to a misalignment of this test scale and the Fractions 
Tutor. On the remaining test scales, the MGR-SE condition shows 
significant learning gains. 

In line with our hypothesis that students working with multiple 
graphical representations would learn more, we found an advan- 
tage of the MGR-SE condition over the SGR-SE conditions on 
several dependent measures (see Table 4). Specifically, we found 
advantages of working with MGRs over working with an SGR on 
reproduction items that included number lines as well as on con- 
ceptual transfer, but not (as we had hypothesized) on reproduction 
with area models or on procedural transfer. This finding suggests 
that MGRs are particularly useful at promoting conceptual knowl- 
edge of fractions. As argued, different graphical representations 
provide different conceptual perspectives on fractions, which 
might encourage students to engage in deeper processing of frac- 
tions concepts and to form a more comprehensive mental model. 
Procedural knowledge, on the other hand, requires students to 
learn how to carry out algorithms. It appears that MGRs do not 
promote the acquisition of transferable procedural knowledge. 

Further, we found that MGRs help learning of number lines, but 
not of area models. It is encouraging that MGRs promote learning 
about the number line because the number line is an important, 
central representational tool in mathematics. It can be used to 
connect fractions to real numbers and decimals, and it is a foun- 
dation for understanding coordinate systems in later algebra 
(NMAP, 2008; Siegler et al., 2010). On the other hand, area 
models may be more intuitive and familiar for students than 
number lines (e.g., because they build on students’ real-world 
knowledge about sharing and division activities, see Mack, 1995). 
Furthermore, area models tend to be introduced earlier in fractions 
instruction than number lines. Perhaps the greater familiarity ex- 
plains that the MGR-SE version of the Fractions Tutor did not help 
students perform better on reproduction with area models, com- 
pared to the SGR-SE conditions. 


General Discussion 


The goal of the experiments presented in this article was to 
investigate whether the well-established advantage of multiple 
external representations (i.e., text and graphic) generalizes to an 
advantage of multiple graphical representations over a single 
graphical representation when each is presented in addition to text 
and numbers. We focus on situations in which students encounter 
graphical representations one-at-a-time, hypothesizing that stu- 
dents will form a more accurate mental model of the domain 
knowledge by gradually refining it as they encounter different 
graphical representations that emphasize complementary concep- 
tual aspects of fractions. This question is interesting from a prac- 
tical standpoint because MGRs are typically used in realistic 
educational materials. This question is also interesting from a 
theoretical standpoint because existing theoretical frameworks for 
learning with multiple external representations do not make spe- 


cific predictions as to whether MGRs are more effective than an 
SGR. Specifically, the CTML (Mayer, 2003, 2005) suggest that 
MGRs may be effective because they might enhance active inte- 
gration and deeper conceptual processing than an SGR, or because 
they can yield more elaborate mental models of the domain con- 
tent. Similarly, the ITPC (Schnotz & Bannert, 2003; Schnotz, 
2005) suggests that MGRs might be more effective than an SGR 
because they enable students to form a more elaborate mental 
model of the domain. However, the ITPC also cautions that the 
potential advantage of MGRs needs to outweigh the costs associ- 
ated with understanding each graphical representation. Investigat- 
ing whether MGRs lead to better learning than an SGR is a step 
toward closing the gap between, on one hand, educational psy- 
chology research that has mostly focused on learning with multiple 
external representations and, on the other hand, common practice 
of using MGRs in instructional materials. 

Our two experiments provide evidence that MGRs can lead to 
better learning than SGRs, when they are accompanied by self- 
explanation prompts. We attribute our finding to the complemen- 
tary conceptual perspectives that MGRs provide on the learning 
content. As in many STEM domains, instruction on fraction uses 
different graphical representations with the goal to emphasize 
different conceptual aspects of the domain (Charalambous & Pitta- 
Pantazi, 2007). Only if students integrate these different concep- 
tual views into one coherent mental model can they gain full 
conceptual understanding of fractions. Deep conceptual processing 
of complex learning material may be crucial to students’ benefit 
from MGRs, in line with both the CMTL and the ITPC. Our work 
extends these frameworks by showing that learning can be en- 
hanced by integrating multiple graphical representations, presented 
consecutively across different problems. 

Although this integration process is critical to students’ benefit 
from multiple representations, students do not (often) engage in it 
spontaneously (Ainsworth et al., 2002; Yerushamly, 1991) and 
thus need to be supported in doing so. We implemented instruc- 
tional support for this critical process in the form of self- 
explanation prompts that encourage students to make connections 
between each graphical representation and the key concepts of 
fractions they depict. In Experiment 1, we found that self- 
explanation prompts were in fact necessary for students in our 
experiment to benefit from MGRs: Only when provided with 
self-explanation prompts did we find an advantage of MGRs over 
SGRs. Although several studies have demonstrated that students 
benefit from self-explaining multirepresentational learning mate- 
rials (e.g., Ainsworth & Loizou, 2003; Berthold et al., 2009; Zhang 
& Linn, 2011), Experiment 1 is, to the best of our knowledge, the 
first to systematically investigate the effects of self-explanation 
prompts while contrasting MGRs versus SGRs. It thus extends 
research that has investigated effects of self-explanation prompts 
on learning with multiple representations. 

Given that MGRs were presented consecutively in our experi- 
ments, the self-explanation prompts likely stimulated gradual re- 
finement of students’ mental model of fractions, as they encoun- 
tered additional conceptual aspects across a sequence of fractions 
problems. As mentioned, we chose to present graphical represen- 
tations across consecutive problems because we consider this to be 
the next logical step in extending research on multiple external 
representations to multiple graphical representations, because, in 
our experiments, the number of direct connections between repre- 
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sentations was held constant across conditions. Furthermore, con- 
current presentation of MGRs may place high demands on cogni- 
tive load. An interesting open question regards whether our 
findings generalize to other possible ways to present MGRs within 
problem sequences, for instance, when MGRs are presented con- 
currently, within the same problem. In light of our findings in 
Experiment 1, we expect that students’ success in learning from 
MGRs will depend on their ability to relate each of them to key 
domain concepts. Based on prior research documenting that stu- 
dents tend not to spontaneously engage in such sense-making 
processes (Ainsworth et al., 2002; Yerushamly, 1991), we expect 
that even when MGRs are presented concurrently, students may 
need to be supported to engage in these processes. Yet presenting 
MGRs concurrently would allow students to make connections 
directly between conceptually corresponding elements of graphical 
representations. On the one hand, support for connection making 
between MGRs might further enhance students’ benefits from 
MGRs because students can then directly compare the different 
conceptual aspects that each graphical representation emphasizes. 
On the other hand, it may be that this task is cognitively over- 
whelming, such that cognitive overload interferes with students’ 
learning. It would be interesting to investigate whether direct 
support for connection making between graphical representations 
further enhances students’ benefits from MGRs, and how this 
support should be designed such that potential negative effects due 
to high cognitive load can be prevented. 

Our experiments showed different effects of MGRs on different 
types of knowledge. We found that MGRs promote learning of 
conceptual knowledge, and (in Experiment 2) that that MGRs help 
students learn the more difficult graphical representation. Since 
different graphical representations provide different conceptual 
perspectives on the abstract concept of fractions, they might en- 
hance deeper processing of crucial concepts within the domain, 
leading to an advantage on conceptual knowledge. However, coun- 
ter to our hypothesis, MGRs did not enhance procedural knowl- 
edge. In retrospect, we can see why MGRs might not be particu- 
larly helpful to students’ learning of procedures performed on 
fractions. It may be that the ability to perform procedures on 
fractions is independent of the graphical representation used, such 
that performing these algorithms on different graphical represen- 
tations does not enhance students’ learning of procedural knowl- 
edge (e.g., fraction addition). There is a hint in the results from 
Experiment 1 to indicate that MGRs may enhance students’ ability 
to transfer procedural knowledge to novel tasks, but overall, the 
evidence that MGRs enhance procedural transfer is rather weak 
and should be explored in future research. Given these consider- 
ations, we expect that our findings generalize to other domains that 
use MGRs to enhance students’ conceptual knowledge by provid- 
ing different conceptual perspectives on the domain, each instan- 
tiated by a particular graphical representation. It is likely that there 
are many such domains, including STEM domains. 

While the effect sizes in Experiment 1 were considerable (rang- 
ing between d = 0.44 and d = 0.99), we found only small effect 
sizes in Experiment 2. We attribute the small effect sizes in 
Experiment 2 to the small learning gains; it may be that a differ- 
ence of d = 0.12 between conditions when the learning gains are 
only d = 0.17 is meaningful. It may be that these effect sizes 
reflect the fact that the students in our studies already had done a 
considerable amount of fractions learning before the study started 


(in fact, students in Experiment 2 came from a higher performing 
student population than those in Experiment 1). It may be, further, 
that these effect sizes reflect that fact that learning with MGRs is 
not without its cost (as pointed out by the theoretical frameworks). 
Not only must students become familiar with the different graph- 
ical representations, the sense-making processes that are required 
to take advantage of these representations to build richer, more 
integrated mental models may impose substantial cognitive load. It 
may be that additional interventions that support students in mak- 
ing direct connections between MGRs in a way that decreases 
cognitive load (e.g., using color coding to direct students’ attention 
to relevant conceptual aspects) would further increase the effects 
of MGRs. Our results may not (yet) justify building up a practice 
of working with MGRs in a domain in which such a practice has 
not yet been established. On the other hand, the practical relevance 
of an intervention depends not only on effect sizes but also on the 
ease with which it is implemented. As discussed, it is common 


practice to use MGRs in many STEM domains. Our results provide 


support for that practice. 

In sum, the work presented here demonstrates that students’ 
learning can benefit from MGRs in a complex and challenging 
area of mathematics learning within the context of realistic edu- 
cational settings. It extends the literature on learning with multiple 
external representations in several important ways. To the best of 
our knowledge, it is the first rigorous experimental investigation 
that compares learning with MGRs to learning with an SGR, each 
provided in addition to text and numbers. Across two classroom 
experiments, our results consistently demonstrate that students’ 
robust learning of conceptual knowledge of fractions can be en- 
hanced by providing them with MGRs, as long as students are 
prompted to self-explain the relation of each graphical represen- 
tation to the key concepts it depicts. 
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Table Al 
Estimated Marginal Means (and Standard Deviations) for Experiment 1 
SGR-noSE SGR-SE MGR-noSE MGR-SE 
Posttest time Variable M SD M SD M SD M SD 
bites tiga are UP ane We ace oe Wien 20 an ee ee eee eS eS ee 
Immediate posttest Conceptual reproduction 2.45 0.67 D2, 0.96 ROD 1.12 2eld 0.64 
Procedural reproduction 2.95 0.62 MIS 0.62 2.68 0.84 2.95 0.58 
Conceptual transfer 1.60 0.64 1.60 0.92 1.65 1.01 lleva 1.14 
Procedural transfer 2.23 1.26 1.65 127 1.27 20 2.36 0.84 
Delayed posttest Conceptual reproduction 1.98 1.10 1.97 1.09 1.42 1.03 2.40 0.70 
Procedural reproduction 255 0.73 2 ee 000 2.31 0.88 28 0.48 
Conceptual transfer 2.30 0.87 DON 0.99 2.06 1.03 2.66 0.58 
Procedural transfer 2.39) 0.96 2.11 1.02 1.94 1.14 De3D 0.77 


Note. SGR = single graphical representation, SE = self-explanation, MGR = multiple graphical representation. The maximum score was 3 for all 
knowledge types. 





Table A2 
Estimated Marginal Means (and Standard Deviations) for Experiment 2 
SGR-SE MGR-SE 
Time Variabie M SD M SD 
Pretest Reproduction with number line 0.43 0.03 0.45 0.03 
Reproduction with area models 0.61 0.03 0.59 0.03 
Conceptual transfer 0.68 0.03 — 0.73 0.03 
Procedural transfer 0.49 0.04 0.58 0.04 
Immediate posttest Reproduction with number line 0.50 0.03 0.60 0.03 
Reproduction with area models 0.64 0.03 0.70 0.03 
Transfer conceptual 0.74 0.03 0.79 0.03 
Transfer procedural 0.46 0.04 0.55 0.04 
Delayed posttest Reproduction with number line 0.51 0.03 0.63 0.03 
Reproduction with area models 0.65 0.03 0.68 0.03 
Transfer conceptual 0.74 0.03 0.84 0.03 
Transfer procedural 0.52 0.04 0.52 0.04 





Note. SGR = single graphical representation; SE = self-explanation; MGR = multiple graphical representa- 
tion. The maximum score was | for all knowledge types. 





Figure Al. Interactive representations used in fractions tutor: circle, rectangle, and number line. 
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First, partition the circle into! = | equal sections. First, partition the circle into © | equal sections. 


Next, drag one section into the white circle diagram. Next, drag one section into the white circle diagram 





To show — you need to make’ copies of 


Circle Anas! 5 | equal sections. Circle B has! = | equal sections. 
The sections in circle A are smsiierthan + the sections in circle B, because the circleAhas re s+ sections. 


_= umber of colored sections as circle B. 

















smatlerthan | 
_largerthan | 
_ equal to 


Figure A2. Making a circle given a symbolic fraction, combined with prompts to compare the two fractions. 
Reflection prompts are implemented with drop-down menus shown at the bottom. 






Please make = ‘ This is the unit: 





Partition the rectangle into [> | equal sections. Partition the rectangle into! = | equal sections. 


i Drag one section into the white rectangle. : | | _ Drag one section into the white rectangle. 


1 
To show — you need to make |"2 | copies of —~ _ 





presse, ores 


Rectangle Ahas 7 total sections. Rectangle Bhas +_ total sections. 


» the sections in rectangle B, because the rectangle A has more ) sections. 








2 { : 
— is smalierthan + 
Therefore, —~ ts smaier es 1m 


Congratulations! You're done! 
; * 


j 





Figure A3. Making a rectangle given a symbolic fraction given a symbolic fraction, combined with prompts 
to compare the two fractions. Reflection prompts are implemented with drop-down menus shown at the bottom. 
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5 
Let's place a dot on the numberline that shows —s 


First, partition the numberline into fe] equal sections. 





Numbertine A: 
o i 


Let's place a dot on the numberline that shows > 


Next, what did you partition the unitinto? ens 


5S 
Now place a dot on the numberiine that shows i. 


Numberline B: fy ot ‘ 
First, partition the numberiine into | 2 equal sections. 


{bt o> Next, what did you partition the unit into? ¢ignhis 
8 i 


Now place a dot on the numberline that shows > 


Both numberlines have |< _ total sections. 


Therefore, the sections in numberline A are | equal oe iy _ the sections in numberiine B. 








Since in numberline A, there are fewer sy. sections between 0 and the dot, ~~ is smaller) 


Congratulations! You're done! 
Done 


Figure A4. Showing a fraction on the number line given a symbolic fraction, combined with prompts to 
compare the two fractions. Reflection prompts are implemented with drop-down menus shown at the bottom. 
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An Imagination Effect in Learning From Scientific Text 


Claudia Leopold 


University of Muenster 


Richard E. Mayer 


University of California, Santa Barbara 


Asking students to imagine the spatial arrangement of the elements in a scientific text constitutes a 
learning strategy intended to foster deep processing of the instructional material. Two experiments 
investigated the effects of mental imagery prompts on learning from scientific text. Students read a 
computer-based text on the human respiratory system (control group), read while being asked to form an 
image corresponding to each of 9 paragraphs (imagery group), or read while being asked to form an 
image and with seeing an onscreen drawing before each paragraph (picture-before-imagery group) or 
after each paragraph (picture-after-imagery group). Imagery prompts facilitated transfer and retention 
performance compared to a control group on an immediate test (Experiment 1: d = 1.30 on transfer, d = 
0.74 on retention) and on a delayed test (Experiment 2: d = 0.86 on transfer, d = 0.98 on retention), but 
the added drawings had no additional effect. The findings support the imagination principle, which states 
that people learn more deeply when prompted to form images depicting the spatial arrangement of what 
they are reading. 


Keywords: imagination, imagery, multimedia learning, learning strategy 


Consider a text that explains how the respiratory system works, 
such as shown in Appendix A. What can be done to help students 
learn more deeply so that they are better able to answer transfer 
questions based on the lesson? One approach is to add graphics, 
such as a graphic for each paragraph that depicts the structure or 
functioning of a portion of the respiratory system as described in 
the paragraph. The rationale for this approach comes from research 
on the multimedia principle, which has shown that students learn 
more deeply from words and graphics than from words alone 
(Butcher, 2014; Mayer, 2009). For example, Mayer (2009) re- 
ported that across more than a dozen experimental comparisons, 
students performed better on a transfer test after reading a scien- 
tific passage accompanied by corresponding graphics (e.g., draw- 
ings or animation) than without graphics, yielding a median effect 
size greater than 1. 

The explanation for the multimedia principle is that students 
given words and graphics are more likely to engage in appropriate 
cognitive processing during learning, including selecting corre- 
sponding information in the text and graphics, organizing this 
information into corresponding cognitive representations, and in- 
tegrating the verbal and pictorial representations with each other 
and with relevant prior knowledge (Mayer, 2009). In his dual 
coding theory, Paivio (1986, 2007) pointed to the positive cogni- 
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tive consequences that occur when learners make referential con- 
nections between words and images during learning. Similarly, the 
cognitive theory of multimedia learning posits that building con- 
nections between corresponding verbal and pictorial representa- 
tions is a central process in meaningful learning, as indicated by 
superior transfer performance. 

The present study takes the multimedia principle one step fur- 
ther by asking whether students can learn more deeply by imag- 
ining the spatial arrangement of elements described in a scientific 
text about how the respiratory system works. We call this process 
seeing with the mind’s eye because the students engage in multi- 
media learning by imagining internal graphics rather than viewing 
external graphics. The goal of the present study is to determine the 
cognitive consequences of asking students to imagine graphics that 
depict the structure and functioning of the respiratory system being 
described in the text. Overall, we aim to test what can be called the 
imagination principle, which posits that students learn more 
deeply from an explanative scientific text when they are asked to 
form mental images corresponding to the structures and processes 
described in the text. The present study is motivated by the relative 
lack of research on the imagination principle as an aid to under- 
standing explanative text (Dunlosky, Rawson, Marsh, Nathan, & 
Willingham, 2013). 


Literature Review 


This study on the imagination principle is motivated in part by 
recognition that the potentially powerful role of mental imagery in 
human learning, memory, and cognition has been examined across 
a variety of research literatures including spatial cognition, verbal 
learning, memory mnemonics, and educational psychology. 

In spatial cognition, a number of studies have examined basic 
characteristics and functions of mental imagery (Farah, 1984; 
Ganis, Thompson, & Kosslyn, 2004; Johansson, Holsanova, & 
Holmavist, 2006; Shepard & Cooper, 1982). An important finding 
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from these studies is that the cognitive processing of imagined 
representations follows similar mechanisms as the cognitive pro- 
cessing of perceived representations (Borst & Kosslyn, 2012; 
Finke, 1985; Kosslyn, Thompson, & Ganis, 2006). Thus, there is 
a functional equivalence between visual mental imagery and visual 
perception. This functional equivalence refers, for example, to how 
people inspect and rotate mental images (Kosslyn, Ball, & Reiser, 
1978; Shepard & Metzler, 1971) and indicates that the underlying 
representations for these imagery processes are depictive and spa- 
tial in nature. 


In verbal learning, studies conducted by Paivio (1986) and his . 


coworkers showed that mental imagery enhances recall perfor- 
mance on basic memory tasks such as remembering word lists. An 
important finding is the concreteness effect, in which people 
remember lists of concrete words or sentences better than lists of 
abstract words or sentences (see Paivio, 1965, 1969, for a review; 
Sadoski, Goetz, & Fritz, 1993). Sadoski, Goetz, and Rodriguez 
(2000) and Goolsby and Sadoski (2013) reported similar findings 
for concrete versus abstract texts. Paivio explained these results 
with the idea that concrete words and sentences evoked imagery 
processes that aided their recall. This interpretation is supported by 
the results of Sadoski and Quast (1990), who found close relations 
between students’ imagery ratings of text passages and their long- 
term recall (r = .40). Furthermore, Paivio and his colleagues 
reported that an instruction to imagine lists of concrete nouns 
versus an instruction to pronounce these nouns improved recall 
probability by about 50% (Paivio, 1975; Paivio & Csapo, 1973). 
Both of these findings can be explained by Paivio’s dual coding 
theory, which posits that adding a nonverbal imaginal code to a 
verbal code serves as a supplementary route for facilitating recall 
(see also Sadoski & Paivio, 2013). 

In memory mnemonics (e.g., keyword method, method of loci), 
the idea of dual coding is applied to facilitate recall of vocabulary 
items (Atkinson, 1975; Raugh & Atkinson, 1975), technical ter- 
minology and foreign words (Carney & Levin, 1998; Jones, Levin, 
Levin, & Beitzel, 2000), and facts (Brigham & Brigham, 1998; 
Levin, Morrison, McGivern, Mastropieri, & Scruggs, 1986; Mc- 
Cormick, Levin, & Valkenaar, 1990). These mnemonic techniques 
have in common that they rely on imagery processes in order to 
establish, for example, referential connections between a vocabu- 
lary item (e.g., the German word Fenster = window) and an 
acoustic associative that is similar in sound (e.g., faint). An exam- 
ple is a mental image of a person standing before a window and 
suddenly fainting so that he or she is falling into the window. 
According to Paivio (1986), these kinds of images help students to 
build connections between verbal and nonverbal (imagery) repre- 
sentations. 

In educational psychology, two branches of research on mental 
imagery can be identified, focusing on the role of imagination in 
learning procedures and in learning facts. In the first research 
branch, focusing on memory for procedures, researchers have 
established an imagination effect when students imagine their 
actions as they learn a procedural task. For example, students were 
asked to imagine the steps of a procedure for how to construct 
formulae in a spreadsheet application (Cooper, Tindall-Ford, 
Chandler, & Sweller, 2001), how to apply geometry rules (Ginns, 
Chandler, & Sweller, 2003), how to find a route in a bus timetable 
(Leahy & Sweller, 2005), or how to use a temperature line graph 
(Leahy & Sweller, 2005). Overall, instructions to form mental 


images in these experiments facilitated learning the various pro- 
cedural tasks when the students had sufficient prerequisite sche- 
mas about the task. The authors explained this effect with the idea 
that imagination requires learners to automatize the procedures 
similar to mental practice in perceptual-motor tasks frequently 
investigated in sports psychology (Driskell, Copper, & Moran, 
1994). As this type of imagery strategy is focused on facilitating 
automation of procedures, it corresponds to an imagery rehearsal 
strategy. 

In the second branch of research, focusing on memory for facts, 
research established an imagination effect when students were 
asked to imagine pictures in their mind corresponding to facts in a 
narrative. For example, in a classic study, Pressley (1976) taught 
elementary school children in a 20-minute training how to form 
mental images and asked them afterward to read a 950-word story 
with the instruction to make up pictures in their head as they read. 
Students in the control group received a control training in which 
they were asked to do whatever they could in order to remember 
the story. The results showed that students in the imagery group 
remembered more facts about the story than the students in the 
control group. Similar results were reported by Gambrell and 
Jawitz (1993) with fourth-grade students, by Kulhavy and Swen- 
son (1975) with sixth-grade children, and by Giesen and Peeck 
(1984) and Rasco, Tennyson, and Boutwell (1975, Experiment 1) 
with college students. 

In contrast, Anderson and Kulhavy (1972) and Rasco et al. 
(1975, Experiment 2) showed no effect of an imagery instruction 
on text recall, but Anderson and Kulhavy found that students who 
actually reported using the imagery strategy performed better in a 
recall test than students who reported not using mental imagery. 
Thus, it seems important to provide clear and specific imagery 
instructions and to check whether the students really follow these 
instructions. 

In general, these results suggest that imaging a picture while 
reading a story is a powerful strategy for fostering recall of facts. 
Furthermore, the results of Rasco et al. (1975) and Gambrell and 
Jawitz (1993) showed that there was no difference between stu- 
dents who were asked to create mental pictures and students who 
were provided with external pictures. Thus, the same underlying 
processes may apply to both internally constructed images and 
externally presented images. 

The present study extends the study of imagination effects 
involving procedural knowledge (by imagining carrying out steps 
in a procedure) and factual knowledge (by imagining mental 
pictures about a story) to an imagination effect involving concep- 
tual knowledge (by forming an image of the spatial structure of a 
scientific system). Investigating the effects of imagination on 
conceptual knowledge is relevant because scientific texts often 
remain challenging for students (Best, Rowe, Ozuru, & McNa- 
mara, 2005; Graesser, 2007). VanLehn and colleagues (2007, 
Experiment 2), for example, found no learning gains from reading 
passages from a physics textbook compared to students who read 
nothing at all but just took the test. Graesser (2007) pointed out 
that scientific or technical text is a challenge because students 
often lack relevant background knowledge and adequate reading 
strategies directed at facilitating deep comprehension. To our 
knowledge, the present study is the first to examine whether 
imagination strategies can apply to the educationally relevant 
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domain of learning how a conceptual system works, using 
problem-solving transfer as a dependent measure. 


Theory and Predictions 


According to the cognitive theory of multimedia learning, mean- 
ingful learning (as measured by transfer test performance) occurs 
when learners engage in appropriate cognitive processing during 
learning, including selecting relevant verbal and visual material 
from the lesson, mentally organizing it into verbal and pictorial 
representations, and integrating the representations with each other 
and with relevant prior knowledge activated from long-term mem- 
ory. Prompts to imagine the spatial arrangement of elements in 
scientific text are intended to prime these processes in the same 
way that providing well-designed graphics creates a multimedia 
effect (Mayer, 2009). 

One process that is crucial in both, in processing text with 
mental imagery and in processing text with corresponding pictures, 
is the integration of words and images, that is, creating referential 
connections between words and corresponding images. In mental 
imagery, the process of creating referential connections is essential 
because mental imagery cannot be applied without the learner 
drawing connections between words or phrases and their corre- 
sponding images (Sadoski & Paivio, 2013). This integration of 
words and images is a key ingredient in generative processing in 
the cognitive theory of multimedia learning; therefore, mental 
imagery can be considered a generative learning strategy (Mayer, 
2009). Similarly, in learning with text and pictures, referential 
connections between words and corresponding pictures are crucial 
for facilitating deeper understanding of the text content, that is, to 
transfer knowledge to new problems (Kester, Kirschner, & van 
Merrienboer, 2005; Mayer, 2009; Mayer, Steinhoff, Bower, & 
Mars, 1995). If creating referential connections is crucial for 
developing a deep understanding of the learning materials and if 
we take into account that referential connections are a key com- 
ponent of the imagination process, then imagery activities should 
improve transfer and retention test performance. 

Furthermore, when students build images of the spatial relation- 
ships that are expressed in the text, these mental imagery activities 
can facilitate mental model building (Johnson-Laird, 1983). This 
spatial form of mental imagery promotes an internal representation 
that preserves topological relations between elements of a system 
and therefore structural equivalence with the referential system 
(Denis, 2008; Denis & Cocude, 1989). On the basis of this internal 
representation, students can derive structural knowledge about the 
major components of the system as well as dynamic knowledge 
about how the system works (Mayer & Gallini, 1990). This is 
consistent with the view that learners can build runnable mental 
models of a dynamic system (Hegarty, 2004). Mental imagery can 
therefore be called a model-focused strategy that should affect the 
students’ ability to transfer their knowledge to new problems. 

To our knowledge, there are no studies that directly test the 
effects of mental imagery in learning from explanative scientific 
text on transfer performance. Leutner, Leopold, and Sumfleth 
(2009) found an interaction between drawing instruction and im- 
agery instruction in terms of imagery instruction facilitating com- 
prehension in the absence of drawing instruction. However, the 
comprehension test required students to draw text-based inferences 
but did not include transfer questions. The results of Leopold, 


Sumfleth, and Leutner (2013) showed that student’s self-reported 
mental imagery activities partly mediated the students’ spatial 
representations about the text content, which in turn mediated 
transfer performance. These studies support the idea that imagery 
affected the students’ spatial representations and deeper under- 
standing, although this idea was not directly tested. 

One problem that may affect the effectiveness of mental imag- 
ery concerns the quality of students’ created images. Denis and 
Cocude (1992) observed that students had difficulties in construct- 
ing accurate mental images from a text that described the spatial 
outline of a fictive island. Denis (2008) related these difficulties to 
two processes—construction and review processes. Constructing 
mental images is a sequential process in which students generate 
images and step by step add one image to the other. By contrast, 
reviewing mental images involves the activation of the whole 
image so that the image can be used for manipulation or compar- 
ison tasks. Although constructing and reviewing are dynamic 
processes that are intertwined, constructing is usually more impor- 
tant in the beginning of a learning phase, while reviewing is more 
important at the end of a learning phase (Denis, 2008). To support 
students in constructing mental images, we presented external 
pictures before the students read and imagined each text paragraph. 
In this sense, the picture provides a scaffold for the imagery 
process (Eitel, Scheiter, Schiiler, Nystr6m, & Holmgqvist, 2013). 
To support students in reviewing and uploading their mental 
images, we presented external pictures after the students read and 
imagined each text paragraph. The picture provided external feed- 
back for their mental image. 

The theoretical rationale for studying the imagination principle 
is the same as for the multimedia principle, that is, both principles 
are based on the idea that deeper learning occurs when learners 
engage in the act of building connections between corresponding 
words and pictures that describe how a system works. In the case 
of the multimedia principle, the pictures are provided by the 
instructor, but in the case of the imagination principle, the pictures 
are imagined by the learner (with guiding instructions). This inte- 
gration of words and graphics is called generative processing in 
the cognitive theory of multimedia learning and is posited to lead 
to meaningful learning outcomes. 

In the present study, students read an explanative scientific text 
on how the human respiratory system works (as shown in Appen- 
dix A) either with or without prompts to imagine (as shown in 
Appendix B). In addition, for some learners, pictures were pro- 
vided as instructional support for the imagery process by present- 
ing a picture before or after each text paragraph. Based on the 
cognitive theory of multimedia learning, we predicted that students 
who were asked to imagine corresponding graphics as they read an 
explanative science text would score higher on subsequent transfer 
tests than students who simply read the text (Prediction 1). We also 
expected that students who were provided with external pictures to 
support the imagery process would score higher on transfer tests 
than students who simply imagined the text on their own (Predic- 
tion 2). 

Secondary predictions were that students who were asked to 
imagine would also show superior performance on retention of the 
key steps in the explanation and on drawing the key steps in the 
explanation as compared to students who simply read (Predictions 
3 and 5). Finally, we expected that students who were provided 
with external pictures to support the imagery process would score 
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higher on retention and drawing tests than students who simply 
imagined the text on their own (Predictions 4 and 6). In addition, 
as a preliminary step, we tested whether the treatment groups 
differed in time on task, self-reported motivation, perceived diffi- 
culty, and mental effort. 


Experiment 1 


Experiment | tested these six predictions on an immediate test. 


Method 


Participants and design. The participants were 85 college 
students recruited from the psychology subject pool of the Uni- 
versity of California, Santa Barbara. Their mean age was 19.09 
years (SD = 1.14), and the percentage of female students was 
64.3%. They scored low on a survey of prior knowledge (M = 
3.28, SD = 2.07, based on a 13-point measure), and their mean 
score on a 10-point test of spatial ability was 4.16 (SD = 3.31). 
The study was based on a between-subjects design with four levels 
of imagery instruction (imagery group, picture-before-imagery 
group, picture-after-imagery group, and control group). Twenty 
students served in the imagery group, 22 in the picture-before- 
imagery group, 20 in the picture-after-imagery group, and 23 in the 
control group. 

Materials. The learning materials were computer based and 
consisted of four versions of a lesson on how the human respira- 
tory system works adapted from a shorter lesson used by Mayer 
and Sims (1994). The text contained 786 words and consisted of an 
introduction and nine paragraphs. We computed a readability score 
using the Flesch-Kincaid grade level formula as an indicator of 
text difficulty (Kincaid, Fishburne, Rogers, & Chissom, 1975). 
The readability score of the text was 9.9, which indicates that the 
text was appropriate for students from Grades 10 and higher and 
was thus appropriate for college students. 

The same text was used in all four versions and is reproduced in 
Appendix A. In all four versions, each paragraph was presented on 
a separate screen along with the following headings: (a) Structure 
of the Nervous System, (b) Steps in the Nervous System to Control 
Breathing, (c) Structure of the Thoracic Cavity, (d) Structure of the 
Airway System, (e) Process of Inhaling, (f) Structure of the Ex- 
change System, (g) Structure of the Circulatory System, (h) Pro- 
cess of Exchanging, and (i) Process of Exhaling. Students clicked 
on a next button in order to move from one paragraph to the next 
paragraph. The presentations were developed using Macromedia 
Authorware 7.0. 

The control version of the lesson included just the text para- 
graphs with the next button presented below each paragraph, as 
exemplified in Figure 1. The imagery version was identical to the 
control version except that a specific imagery instruction was 
added to the right of each paragraph, for example, “Please imagine 
the steps in the nervous system when the brain sends a signal to the 
diaphragm and rib muscles.” Figure 2 shows a screenshot from the 
imagery version of the lesson. The imagination instructions for 
each of the nine paragraphs are listed in Appendix B. The picture- 
before-imagery version of the lesson was identical to the imagery 
version except that a drawing was presented before each para- 
graph. The drawing depicted the content of the following para- 
graph. In order to move from the picture to the corresponding 


Structure of the Nervous System 


The respiratory center is located in the 
rear, bottom part of the brain, near the 
back of the neck. The respiratory 
center of the brain is connected to a 
pathway of nerves that leads down 
from the spinal cord to connect with 
muscles controlling the diaphragm and 
the rib cage. 


Figure 1. A screenshot of the program presented to the control group. See 
the online article for the color version of this figure. 


paragraph, the students clicked on the next button. In order to 
move from this paragraph to the next picture, they clicked on the 
next button again, and so on. The students only saw either the 
picture or the paragraph, never both of them at the same time. 
Figure 3 shows a screenshot of a picture shown before a paragraph 
in the lesson. The picture-after-imagery version was identical to 
the picture-before-imagery version except that the corresponding 
picture was presented after the students had read and imagined the 
respective paragraph with the instruction: “Please compare this 
with your mental picture.” Thus, students in the picture-after- 
imagery group first saw the paragraph with the imagery instruc- 
tion, then clicked the next button and saw the corresponding 
picture. When they clicked on the next button again, they moved 
on to the next paragraph, and so on. 

The testing materials consisted of a retention test, a transfer test, 
a drawing test, a paper-folding test, and a questionnaire. The 
testing materials were printed on 8.5-in. X 11-in. sheets of paper. 

The retention test contained the following instruction at the top 
of the sheet: “Using what you learned in the session, please write 
an explanation of how the human respiratory system works.” For 
scoring the student’s explanations, we divided the text into 35 idea 
units based on the paragraphs about the process of respiration and 
41 idea units based on the paragraphs about the structure of the 
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Structure of the Nervous System 


The respiratory center is located in the 
rear. bottom part of the brain, near the 
back of the neck, The respiratory 
center of the brain is connected to a 
pathway of nerves that leads down 
from the spinal cord to connect with 
muscles nontrolling the diaphragm and 
the Nb cage. 





Please imagine the structure of the nervous system consisting of the brain, nerves. diaphragm and rib muscles. 


Figure 2. A screenshot of the program presented to the imagery group. See the online article for the color 


version of this figure. 


respiratory system. The headings for the paragraphs on the process 
of respiration were (a) Steps in the Nervous System to Control 
Breathing, (b) Process of Inhaling, (c) Process of Exchanging, and 
(d) Process of Exhaling. We computed a process-retention score 
for each student by counting the number of ideas (out of 35) that 
the student included in his or her explanation. One point was given 
for correctly stating each of the 35 idea units, for example, “brain 
detects the need for oxygen,” “brain sends out a signal to inhale,” 
“signal moves to muscles controlling the diaphragm or rib cage,” 
“the diaphragm contracts downward,” and “the rib cage moves 
slightly outward.” The headings of the paragraphs about the struc- 
ture of the respiratory system were (a) Structure of the Nervous 


Please study tis wetwe! 


Respiratory —... a 
Center > 


System, (b) Structure of the Thoracic Cavity, (c) Structure of the 
Airway System, (d) Structure of the Exchange System, and (e) 
Structure of the Circulatory System. We computed a structure- 
retention score for each student by counting the number of ideas 
(out of 41) that the student included in his or her explanation. One 
point was given for correctly stating each of the 41 idea units, for 
example, “respiratory center is located in rear part of brain,” “from 
brain nerves lead down the spinal cord,” “nerves lead to muscles 
of the diaphragm and rib cage,” “the thoracic cavity contains the 
lungs,” “the thoracic cavity is surrounded by ribs,” “ribs can move 
inward or outward,” and “the diaphragm is on the bottom of the 
thoracic cavity.” The participant did not have to show the exact 





Figure 3. A screenshot of a picture shown to the picture-before-imagery group. See the online article for the 
color version of this figure. 
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wording or correct spelling to receive credit for an idea unit but 
had to express the correct idea. All scoring was done by consensus 
between two raters who were blind to the participants’ group. 
Separate scores for process and structure were computed in order 
to determine whether the effects of imagining helped students to 
visualize the static structure of the respiratory system and the 
dynamic functioning of the respiratory system. Interrater reliabili- 
ties based on 25% of the data were r = .93 for the process- 
retention test and r = .79 for the structure-retention test. 

The transfer test consisted of five sheets, each containing a 
question that required the students to apply their knowledge to new 
problems, such as “Although there is oxygen in the lungs, the cells 
in the body do not get enough oxygen to make energy. What could 
have caused this problem?” or “Suppose you are a scientist who is 
trying to improve the human respiratory system for people who 
climb high mountains (where less oxygen is in the air). What could 
be done to make the human respiratory system more effective for 
mountain climbing?” We computed a transfer score for each 
participant by counting the number of acceptable answers across 
the five transfer questions (Cronbach’s a = .66). The reliability of 
a = .66 is acceptable but a bit lower than expected, which may 
depend on the fact that only five transfer questions were used. 
Interrater reliability based on 25% of the data was r = .98. 
Acceptable answers for the first question were air cannot get into 
the air sacs, capillaries cannot pick up oxygen from the air sacs, the 
arteries are blocked, veins do not take away carbon dioxide, 
the connection between lungs and heart is blocked, blockage in the 
bronchioles, heart does not beat regularly, cells cannot absorb 
oxygen, and so on. Acceptable answers for the second question 
were expand the rib muscles, expand the diaphragm, expand the 
lung’s capacity, change the regulation of the brain’s system, ab- 
sorb more oxygen with every breath, make the exchange system 
more effective, add air sacs in the lungs, add something that can 
bind more oxygen in the blood, add more channels of capillaries or 
arteries, and so on. The transfer test was intended to assess the 
learner’s depth of understanding of the material and therefore is 
the primary learning outcome measure in this study. 

The drawing test contained the following two instructions, each 
typed on a separate sheet: “Please draw a picture of the exchange 
system and label the different parts.” “Please draw a picture of the 
respiratory system when the person inhales and label the different 
parts.” These instructions referred to the representation of key 
components of the respiratory system explained in the text and 
their spatial relations. The students were informed that sketching 
the important components and their interrelations would be suffi- 
cient rather than drawing aesthetically appealing pictures. The 
accuracy of the exchange drawing was assessed using a checklist 
that consisted of seven criteria based on seven components of the 
exchange system and their spatial location, that is, the lungs, the 
alveoli, the capillaries, oxygen, carbon dioxide, arteries, and veins. 
The accuracy of the inhaling drawing was assessed using a check- 
list based on four criteria, that is, the windpipe-to-lung connection, 
expansion of lungs, flattening of the diaphragm, and expansion of 
ribs. An accurate drawing of each component was given 2 points, 
a partly accurate component was given | point, and an unaccept- 
able drawing of a component received 0 points. For example when 
a student’s drawing showed the diaphragm beneath the lungs and 
the student had indicated (by arrows or by labels) that the shape of 
the diaphragm was flattened, 2 points were given. When it was not 


obvious that the diaphragm was flattened, 1 point was given. When 
the diaphragm was drawn in the incorrect shape or when the 
diaphragm was not mentioned at all, 0 points were given. The 
maximal ‘number of points was 22 for the two drawings (Cron- 
bach’s a = .72). Two raters scored 25% of the students’ drawings 
with an interrater reliability of r = .95. The drawing test was 
intended to assess the quality of the student’s self-generated 
visual-spatial representations of the respiratory system. 

The paper-folding test consisted of a sheet with 10 problems 
taken from Ekstrom, French, Harman, and Dermen (1976). Each 
item required imaging a paper being folded, punched with a hole, 
and reopened. One point was given for each correct answer, and 
one point was subtracted for each wrong answer, with a total 
possible score of 10. The paper-folding test was intended to 
measure an aspect of spatial ability that has been found to be 
related to mental imagery (Denis, 2008). , 

The first sheet of the questionnaire consisted of seven self-report 
scales on the students’ motivation, their effort/difficulty, and strat- 
egy use. The motivation questions were “I enjoyed learning from 
this lesson,” “I would like to learn from more lessons like this,” 
and “Please rate how appealing this lesson was for you.” Each was 
accompanied by a 5-point scale ranging from strongly agree to 
strongly disagree (Cronbach’s a = .84). The effort and difficulty 
questions were “Please rate how difficult this lesson was for you” 
(with a 5-point scale from very easy to very difficult) and “Please 
rate how much effort you exerted in learning this lesson” (with a 
5-point scale ranging from very low to very high). The final two 
questions were “Please rate your spatial ability” (with a 5-point 
scale from very low to very high) and “I prefer to learn visually” 
(with a 5-point scale from strongly agree to strongly disagree). 

The second sheet consisted of a four-item self-report question- 
naire on mental imagery and six distractor items. The four items 
were translated from a version used by Leopold et al. (2013) and 
were used to check whether students followed the instructions 
during the study phase (with Cronbach’s a = .85). All self-report 
scales were 5-point scales ranging from 1 (strongly disagree) to 5 
(strongly agree). The items were “I mentally imagined how the 
processes described in the text work,” “I formed mental pictures 
about the text content in order to understand it,” “I created mental 
pictures about the text content,” and “I tried to understand the 
structures and processes of the respiratory system by mental im- 
agery.” 

The final sheet of the questionnaire included questions concern- 
ing the students’ age, gender, and prior knowledge about the 
human body. Their knowledge of the human body was measured 
by using a 13-item checklist. Students were asked to “place a 
check mark next to the things that apply to you,” based on the 
following list: “I have participated in science programs or fairs,” 
“Biology was my favorite subject in high school,” “I sometimes 
watch science documentaries about anatomy in my free time,” “T 
can name most of the components of the human heart from 
memory,” “I have taken a course in human anatomy or physiol- 
ogy,” “I attended a course on cardiopulmonary resuscitation (CPR) 
training,” “I can explain what pulmonary embolism means,” “I 
sometimes find myself on the Internet looking up biology related 
topics,” “I know the difference between venous and arterial 
blood,” “I have watched an educational video on how the respi- 
ratory system works,” “I talked to a doctor about how the process 
of respiration works,” “I know the definition of the terms ‘dia- 
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stolic’ and ‘systolic’,” and “I took advanced biology classes in 
high school (AP, IB, Honors), etc.” A prior knowledge score was 
computed by tallying the number of items checked on the check- 
list, yielding a maximum score of 13. 

Procedure. Students were randomly assigned to the experi- 
mental groups and were tested in groups of one to three per 
session. Each student was seated at an individual cubicle in front 
of a computer. First, participants signed an informed-consent 
sheet. Then, the experimenter presented oral instructions stating 
that the students would receive a lesson on the human respiratory 
system that would comprise an introduction and nine paragraphs. 
They were told to go at their own pace and that when they were 
finished, the experimenter would have some questions for them to 
answer. Students pressed the space bar and then saw a page that 
welcomed them and thanked them for their participation in the 
experiment. This page repeated the instructions that they would 
read an introduction and nine paragraphs on the respiratory system 
and should click on the next button in order to move through the 
presentation. Furthermore, students in the three imagery groups 
were informed that first they would receive a short pretraining in 
how to use their mental imagination. This pretraining consisted of 
examples of how to imagine text content based on two text 
paragraphs about the global warming effect. The text in each 
paragraph was presented sentence by sentence. After reading the 
first sentence, students pressed the space bar and then saw how a 
possible image of that sentence would look. When they pressed the 
space bar again, the next sentence was presented, and after press- 
ing the space bar again, the image was adapted to the content of the 
new sentence, and so on. On average students spent 95.87 s (SD = 
39.62) on this pretraining of how to form mental images. 

After the pretraining, the imagery groups were presented with a 
page that told them that now they would receive the lesson on the 
human respiratory system and that they were asked to imagine the 
text as a graphic. The students in the picture-before-imagery group 
were informed that in order to help them to imagine the informa- 
tion, they would be shown a drawing before each paragraph. The 
students in the picture-after-imagery group were informed that in 
order to help them, they would be shown a drawing after they had 
imagined the paragraph so that they could compare their mental 
picture with the presented drawing. The students then studied the 
respective version of the presentation on the respiratory system 
according to their experimental group. 

When the presentation was finished, the experimenter presented 
instructions for the paper-folding test, and then, the students were 
given the paper-folding test for 3 minutes. After 3 minutes, the 
experimenter collected the paper-folding test and distributed the re- 
tention sheet, which asked the students to write an explanation of how 
the human respiratory system works. After 5 minutes, the retention 
sheet was collected, and the transfer sheets were presented one at a 
time. Students were given 2.5 minutes for each transfer question. Each 
transfer sheet was collected by the experimenter before the next sheet 
was presented. Afterward, the two sheets of the drawing test were 
distributed one at a time. Students were asked (a) to draw a picture of 
the exchange system and label the different parts and (b) to draw a 
picture of the respiratory system when the person inhales and to label 
the different parts. Students were told that their picture did not have to 
be beautiful but could be simple and should depict the important parts. 
Two and a half minutes were given for each of the drawing tasks. 
Then, the experimenter distributed each sheet of the questionnaire, 


with instructions to answer as honestly as possible and to complete the 
questionnaire at their own rate. Upon completion, participants were 
thanked and excused. We followed guidelines for ethical treatment of 
human subjects. 


Results and Discussion 


We computed analyses of variance to test overall differences 
among the experimental groups. To test predictions requiring a 
comparison of the three treatment means with the mean of the 
control group, we used Dunnett tests. Dunnett tests control the 
familywise Type I error rate at 5% and is at the same time a 
powerful test specifically designed to compare each treatment with 
a control (Klockars & Sax, 1986; Sheskin, 2011). To test predic- 
tions requiring comparisons between the picture-imagery groups 
and the imagery group, we used Bonferroni’s correction as it is 
suitable for a small number of comparisons and can be transferred 
to nonorthogonal comparisons. 

Before testing the effects of the imagery instructions on the 
dependent variables, we examined whether the four treatment 
groups were equivalent on basic characteristics and whether the 
groups followed their particular instruction. 

Are the groups equivalent on basic characteristics? 
Analyses of variance (with p < .05) showed that the groups did not 
differ on prior knowledge, spatial ability, age, their self-rated 
spatial ability, and their preference for learning visually. A chi- 
square analysis revealed there was a difference in the proportion of 
males and females.! 

Did the students follow the instructions? In the self-report 
questionnaire, we asked the students whether they really had 
imagined the text content. The top line of Table 1 shows the mean 
imagery score (and standard deviation) for each group, with higher 
scores indicating higher degrees of imagery during learning. We 
used these data as a manipulation check. An analysis of variance 
revealed a significant effect of treatment, F(3, 81) = 4.25, MSE = 
47, p = .008, n> = .14. Dunnett tests showed that the picture- 
before-imagery group (p = .053), the imagery group (p = .007), 
and the picture-after-imagery group (p = .012) each reported more 
mental imagery activity during learning than the control group did. 
We take this as evidence that the imagination prompts were 
successful in promoting imagery during learning from the science 
text. Table 1 shows that the control group spontaneously reported 
imagining to a substantial degree, which can be explained by the 
fact that the text used concrete language, which has been shown to 
evoke mental images (see the review of Sadoski & Paivio, 2013). 
Our results indicate that explicit strategy instruction enhanced this 
effect. 

Does imagery instruction facilitate transfer performance 
(Prediction 1)? The primary research question addressed in this 
study concerns whether students learn more deeply from a science 
text when they are prompted to form mental images of the respi- 
ratory system and how it works as they read. Based on the 
imagination hypothesis, students who form mental images of the 


' We computed analyses of covariance with gender as a covariate for all 
of the performance measures. The results did not change except for the 
main effect of treatment on process-retention scores. Although gender was 
not a significant predictor, F(3, 80) = 1.54, p = .218, the main effect of 
treatment did not remain significant (p = .109). 
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Table 1 
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Experiment 1: Means, Standard Deviations, and Effect Sizes for Self-Reported Mental Imagery, Motivation, Perceived Difficulty, and 


Mental Effort 


Teer e ee eee ee ee ee eee ee ee ee a rae 





Experimental group 


Control Imagery Picture-imagery Imagery-picture 
Self-report ratings M SD M SD M SD d M SD d 
Mental imagery 3.68 0.89 434° 0.55 0.92 4.17? 0.47 0.72 4.30* 0.74 0.75 
Motivation 3.03 0.92 BWP) 0.82 0.21 B10n 0.62 0.87 3.55 0.95 0.53 
Perceived 3.30 0.93 3.00 0.92 a3 DAS) 0.95 (ai) 2.85 0.88 —0.50 
. difficulty 
Mental effort 3.00 0.80 3.10 0.72 0.13 3.05 0.90 0.06 2.85 1.09 —0.16 


@ Indicates significant difference from the control group. 


system described in a science text should understand the material 
more deeply—through building connections between correspond- 
ing words and images—and therefore perform better on tests of 
problem-solving transfer. The top row of Table 2 presents the 
means (and standard deviations) of each group on the transfer test. 
An analysis of variance conducted on these data demonstrated a 
significant effect of treatment, F@, 81) = 4.70, MSE = 15.30, p = 
004, n? = .15. Dunnett tests showed that the imagery group 
outperformed the control group (p = .001), and there was no 
significant difference between the picture-before-imagery group 
and the control group (p = .098) or the picture-after-imagery 
group and the control group (p = .180). In line with our prediction, 
the superiority of the imagery group over the control group is new 
evidence for an imagination effect in which students learn more 
deeply when they are asked to form mental images of an explan- 
ative scientific text. This finding is a primary contribution of this 
study. 

Do the imagery groups that received drawings show better trans- 
fer performance than the pure imagery group (Prediction 2)? The 
means in the top row of Table 2 seem to indicate that the groups 
that received drawings and imagery prompts (i.e., picture-before- 
imagery and picture-after-imagery groups) showed lower scores in 
their transfer performance than the imagery group did. Planned 
comparisons revealed that the mean transfer score of the imag- 
ery group did not differ significantly from the picture-before- 
imagery group, #(81) = 1.66, p = .200, or from the picture- 
after-imagery group, t(81) = 1.86, p = .134. As can be seen and 
contrary to our predictions, providing drawings of the human 
respiratory system (in the picture-before-imagery group or 


Table 2 


picture-after-imagery group) does not add to the effectiveness 
of imagination prompts, perhaps because the students did not 
have to work as hard to mentally construct their illustrations. 

Does imagery instruction facilitate retention performance 
(Prediction3)? The foregoing analysis shows an imagination 
effect in which asking students to mentally create drawings for a 
science text results in improvements in transfer test performance, 
indicating deeper learning. A second research question concerns 
whether imagination prompts also help students better remember 
the structure and process of the system described in a scientific 
text. The second row of Table 2 shows each group’s mean reten- 
tion score (and standard deviation) for text describing the process 
of respiration, whereas the third row shows the mean retention 
scores (and standard deviations) for text describing the structure of 
the respiratory system. According to the imagination hypothesis, 
instructions to form mental images for the content of a scientific 
text should result in better memory for the process and structure of 
the system described in the text. 

With regard to process-retention scores shown in the second line 
of Table 2, an analysis of variance revealed a significant overall 
effect of treatment, F(3, 81) = 3.35, MSE = 13.06, p = .023, 7 
.11. To examine whether the three experimental groups who re- 
ceived an imagery instruction performed better than the control 
group did, we computed Dunnett tests. The results showed that the 
picture-before-imagery group (p = .042), the imagery group (p = 
.052), and the picture-after-imagery group (p = .02) each per- 
formed better than the control group did. With regard to structure- 
retention scores shown in the third line of Table 2, an analysis of 
variance revealed a significant overall effect of treatment, F(3, 


Experiment 1: Means, Standard Deviations, and Effect Sizes for the Learning Outcome Variables 





Experimental group 





1 ass Weladicole Control Imagery Picture-imagery Imagery-picture 
scores M SD M SD d M SD d M SD d 
Transfer test 6.13 3.14 10.60* She 1.30 8.59 4.67 0.63 8.30 3.99 0.61 
Process retention 6.57 3.63 9.20" 3.50 0.74 9.23" 4.04 0.69 9.60" 3.19 0.89 
Structure retention lah 1.80 Asse 2.76 0.87 ea. 2.18 —0.20 DS: 173 —0.01 
Drawing test 8.09 4.23 11.30° 3.77 0.80 13.06 4.31 1.16 11.80* 4.76 0.83 
Study time (in seconds) 284.93 69.93 351.62 103.98 0.77 402.15° 170.30 0.98 406.91° 148.65 1.12 


Indicates significant difference from the control group. 
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81) = 5.14, MSE = 4.60, p = .003, yn? = .16. Dunnett tests 
showed that the imagery group significantly outperformed the 
control group (p = .011), and there was no significant difference 
between the picture-before-imagery group and the control group 
(p = .873) or the picture-after-imagery group and the control 
group (p = .999). The mean scores in the structure-retention test 
were quite low and indicate a floor effect. The lower scores can be 
explained by the fact that the students were asked to write an 
explanation “of how the human respiratory system works.” This 
task did not require the students to write down structure informa- 
tion but focused on process information. 

Overall, the reported results provide additional support for the 
imagination hypothesis, in which students learn better when they 
form mental images about the structure and process of the system 
described in a science text as they read. One limitation of the 
results is that with gender as a covariate, the overall effect on 
process-retention scores did not remain significant. 

Do the imagery groups who received drawings show better 
retention performance than the imagery group (Prediction 4)? 
We compared the picture-before-imagery group and the picture- 
after-imagery group with the imagery group, respectively. With 
regard to the process-retention score, neither the picture-before- 
imagery group nor the picture-after-imagery group significantly 
differed from the pure imagery group, #(81) < 1. With regard to 
the structure-retention score, the imagery group outperformed the 
picture-before-imagery group as well as the picture-after-imagery 
group, #81) = 3.59, p = .002, and 7(81) = 2.95, p = .008, 
respectively. Overall, contrary to our predictions, there is no indi- 
cation that adding pictures enhances the effectiveness of imagina- 
tion prompts. 

Does imagery instruction facilitate drawing performance 
(Prediction 5)? The imagination hypothesis predicts that asking 
students to form images during learning will improve their perfor- 
mance on a drawing test. The fourth line in Table 2 shows the 
mean drawing score (and standard deviation) for each group. An 
analysis of variance conducted on the data summarized in the 
fourth line of Table 2 demonstrated a significant effect of treat- 
ment, F(3, 81) = 5.48, MSE = 18.35, p = .002, y7 = .17. 
Consistent with predictions, Dunnett tests showed that the picture- 
before-imagery group (p = .001), imagery group (p = .044), and 
picture-after-imagery group (p = .016) performed better than the 
control group did. 

Do the imagery groups that received pictures show better 
drawing performance than the imagery group (Prediction 6)? 
Neither the picture-before-imagery group nor the picture-after- 
imagery group performed significantly better than the imagery 
group, #(81) = 1.32, p = .382, and #81) < 1, respectively. 
Apparently, actually seeing a picture and simply imaging a picture 
based on a science text produced equivalent improvements on a 
subsequent drawing test. Contrary to our predictions but compa- 
rable to the results of the transfer and retention tests, adding 
pictures did not increase drawing performance. 

Do the groups differ in time on task? The mean study times 
and standard deviations for each group are shown in Table 2. There 
was a significant difference among the treatment groups in study 
time, F(3, 81) = 4.31, p = .007, 1? = .14. Dunnett tests showed 
that the picture-before-imagery group (p = .009) and the picture- 
after-imagery group (p = .008) spent more time with the presen- 
tation than the control group did, but the imagery group did not 


differ significantly from the control group (p = .226). To take into 
account the difference in study time, we computed analyses of 
covariance (ANCOVAs) with study time as a covariate and the 
performance measures as dependent variables. The conclusions 
remain the same when we run an ANCOVA with time as the 
covariate. The results do not differ from the former analysis for the 
transfer score, F(3, 80) = 5.08, MSE = 15.27, p = .003, 7 = .16; 
for the structure-retention score, F(3, 80) = 5.28, MSE = 4.61, 
p = .002, n* = .17; and for the drawing score, F(3, 80) = 5.06, 
MSE = 18.54, p = .003, n* = .17, except that the overall effect of 
the treatment on the process-retention score did not remain signif- 
icant, F(3, 80) = 2.65, p = .055. The effect of the covariate time 
was in none of the analyses significant (p > .149). There was also 
no significant interaction between the treatment and time on any of 
the performance measures (p > .200). 

Do the students differ in their motivation, perceived diffi- 
culty, and mental effort scores? Table 1 summarizes the mean 
motivation rating and mental effort scores (and standard devia- 
tions) for each group. With regard to motivation, there was a 
significant effect for treatment, F(3, 81) = 2.88, MSE = .70, p = 
.041, 1? = .10. Dunnett tests indicated that the picture-before- 
imagery group reported more enjoyment with the lesson than the 
control group did (p = .024), but the picture-after-imagery group 
and the imagery group did not differ from the control group (p = 
.131 and p = .809, respectively). With regard to the perceived 
difficulty of the lesson and the mental effort invested in studying 
the lesson, there were no differences among the experimental 
groups, Fs(3, 81) < 1. Overall, there is no strong evidence that 
students who received imagery prompts reported more difficulty or 
effort than the control group. Only the picture-before-imagery 
group reported more motivation than the control group did. How- 
ever, self-report measures may not be the best way to assess these 
factors as they can be influenced by a tendency to choose 
socially approved behaviors, student ability, instruction, con- 
text of assessment, and so on (e.g., Kruger & Dunning, 1999). 


- Behavioral measures such as the students’ persistence in study- 


ing further text passages, physiological measures, or dual task 
paradigms might provide more reliable data (Briinken, Seufert, 
& Paas, 2010; DeLeeuw & Mayer, 2008). 

How do the four learning outcome measures relate to one 
another? Table 3 shows a correlation matrix for the four learn- 
ing outcome measures, with significant correlations indicated in 
bold font. As can be seen, the transfer score correlates significantly 
with each of the other three scores, the drawing score correlates 
significantly with two other scores, the process-retention score 
correlates significantly with two other scores, and the structure- 


Table 3 
Correlations Among Transfer, Process-Retention, Structure- 
Retention, and Drawing Scores 


Learning outcome 


scores 1 2 3 4 
1. Transfer — si 32 ay) 
2. Process retention _- .06 53 
3. Structure retention _- ely 


4. Drawing a 


Note. Significant correlations are indicated in bold. 
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retention score correlates significantly with one other score. Over- 
all, the transfer score appears to be the most inclusive, suggesting 
that it may be the best measure of deep learning. Consistent with 
results of other studies (Leopold et al., 2013; Schwamborn, Mayer, 
Thillmann, Leopold, & Leutner, 2010), there is a high correlation 
between the drawing scores and the transfer and process-retention 
scores (r = .52, r = .53, respectively), indicating close connec- 
tions between the quality of the spatial representation of the 
respiratory system and measures of deep learning. 

Does drawing performance mediate the effect of imagination 
activity on transfer performance? The results reported above 
showed that the imagery group performed better on tests of 
problem-solving transfer than the control group. We expected that 
the effect of condition (control vs. imagery) on transfer perfor- 
mance would be mediated by the quality of the students’ internal 
spatial representations of the respiration process assessed by their 
drawing performance. This mediation hypothesis is based on the 
idea that mental imagery instruction promotes an internal spatial 
representation that preserves structural equivalence with the refer- 
ential system and therefore facilitates transfer. Following the pro- 
cedure proposed by Baron and Kenny (1986), we performed sim- 
ple and multiple regression analyses. First, we computed the direct 
effect of the independent variable condition (code 1 = control, 
code 2 = mental imagery) on the dependent variable (transfer 
performance): 8 = .56, p < .001 (see Figure 4). Second, we tested 
whether the independent variable (condition) affects the mediating 
variable (spatial representation): 8 = .38, p = .013; third, we 
tested whether the mediating variable (spatial representation) af- 
fects transfer performance: B = .52, p < .001. Multiple regression 
analysis revealed that the direct effect of condition on transfer 
performance was reduced when the effect of the mediating vari- 
able was controlled: 8 = .38, p = .003. To test whether the 
indirect effect, that is, the path from the independent variable 
condition via the mediating variable spatial representation, on 
transfer test performance is significant, we conducted the Sobel 
test (Sobel, 1982; see also MacKinnon, Lockwood, Hoffman, 
West, & Sheets, 2002). The indirect effect (8 = .38 X .52 = .20) 
was significant (z = 2.16, p = .031). Thus, condition (i.e., imagery 
instructions) influenced transfer performance by affecting the stu- 
dents’ spatial representations, which in turn affected their transfer 
performance. These results support partial mediation of the effect 
of condition on transfer performance via the quality of the stu- 
dents’ spatial representations of the respiration process. 


Experiment 2 


In Experiment 1, the main result was that mental imagery 
prompts facilitate transfer performance. In Experiment 2, we fo- 
cused on examining whether the imagination effect is stable over 
a time delay. Therefore, we included only an imagery group and a 
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Figure 4. Mediation model in Experiment 1. ~ p < .05. ™ p < .01. 
*“"" pb < .001. See the online article for the color version of this figure. 





control group. According to multimedia theory and dual coding 
theory, when learners build connections between text and a mental 
image, they construct a deeper learning outcome with more re- 
trieval routes, which should enhance performance over a time 
delay. Previous research on learning strategies further suggests that 
a particularly useful measure of the effectiveness of learning 
strategies involves performance on a delayed test of retention or 
transfer (Dunlosky et al., 2013), so in the interests of consistency, 
we sought to determine whether the findings of Experiment 1 
could be replicated after a 2-day delay. 


Method 


Participants and design. The participants were 48 college 
students recruited from the paid psychology subject pool at the 
University of California, Santa Barbara. Their mean age was 19.73 
years (SD = 1.38), and the percentage of female students was 
71.1%. They scored low on a survey of prior knowledge (M = 
3.04, SD = 2.62, based on a 13-point measure), and their mean 
score on a 10-point test of spatial ability was 5.42 (SD = 3.59). 
The study was based on a between-subjects design with two of the 
experimental conditions used in Experiment 1: imagery group and 
control group. Twenty-three students served in the imagery group, 
and 25 served in the control group. These conditions were identical 
to the ones used in Experiment 1 except that we did not immedi- 
ately test their retention, transfer, and drawing performance but 
rather tested them after a delay of 2 days. Two students did not 
return for the second part of the study, so we excluded their data. 

Materials. The learning and testing materials were identical to 
the ones used in Experiment 1 except that two experimental 
conditions instead of four were implemented. The reliability 
(Cronbach’s alpha) of the scales was very similar to the ones 
reported in Experiment 1: « = .72 for the transfer test, a = .75 for 
the drawing test, a = .91 for self-reported imagery, and a = .83 
for self-reported motivation. 

Procedure. The procedure was identical to that used in Ex- 
periment 1 except that the study consisted of two parts. In the first 
part, the students studied the presentation on how the human 
respiratory system works and filled in the paper-folding test. Then, 
the students were dismissed and asked to come back after 2 days. 
Students were told not to study anything about the human respi- 
ratory system during the 2-day delay. In the second part of the 
study, the students completed the retention test, the transfer test, 
the drawing test, and the questionnaire as in Experiment 1. The 
tests were scored with the same procedures used in Experiment 1. 


Results and Discussion 


Are the groups equivalent on basic characteristics? 
Analyses of variance or chi-square tests (with p < .05) showed that 
the groups did not differ on prior knowledge score, spatial ability, 
self-rated spatial ability, preference for learning visually, mean 
age, or proportion of males and females. 

Did the students follow the instructions? The top line of 
Table 4 shows the mean imagery score (and standard deviation) for 
both groups. The analysis of the students’ answers on the ques- 
tionnaire reveals that the students in the imagery group reported 
more mental imagery activity than the students in the control group 
did, t(46) = 3.55, p = .001, d = 1.08. We take this as evidence that 
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Table 4 
Experiment 2: Means, Standard Deviations, and Effect Sizes for 
Self-Reported Mental Imagery, Motivation, Perceived Difficulty, 


and Mental Effort 
ee ee es See 


Experimental group 





Control Imagery 

Self-report ratings M SD M SD d 

Mental imagery 3:37, 1.11 4.29% 0.60 1.08 

Motivation 2D 0.96 3.14 0.84 0.66 

Perceived 3.32 1.18 S52 0.85 0.20 

difficulty 

Mental effort Ds] Ss 0.99 3:22 1.00 0.47 
Fae SA NE ea EE PORN SO EE eee 


“Indicates significant difference from the control group. 


the imagery group followed the imagery prompts during learning 
from the science text. 

Does imagery instruction facilitate transfer performance 
(Prediction 1)? We hypothesized that students who form mental 
images of the system described in the science text should under- 
stand the material more deeply—and therefore perform better on 
tests of problem-solving transfer. The top row of Table 5 presents 
the means (and standard deviations) of the two groups on the 
transfer test. A f test revealed a significant difference among the 
groups, t(46) = 2.93, p = .005, d = .86, with the imagery group 
scoring higher than the control group. Thus, the imagery prompts 
facilitated problem-solving transfer compared to a control condi- 
tion, even after a delay of 2 days. 

Does imagery instruction facilitate retention performance 
(Prediction 3)? The second and third rows of Table 5 show the 
mean retention scores (and standard deviations) of the two groups. 
There was a significant effect of treatment on the process-retention 
score, (46) = 3.40, p = .001, d = .98, and a nonsignificant 
marginal effect on the structure-retention score, t(46) = 1.73, p = 
.092, d = .50, with the imagery group scoring higher than the 
control group. Thus, imagery prompts facilitated retention of pro- 
cess information compared to a control condition even after a 
delay of 2 days. 

Does imagery instruction facilitate drawing performance 
(Prediction 5)? The fourth line in Table 5 shows the mean 
drawing score (and standard deviation) for each group. There was 
a significant effect of treatment on the drawing score, t(46) = 3.27, 


Table 5 
Experiment 2: Means, Standard Deviations, and Effect Sizes for 
the Learning Outcome Variables 





Experimental group 


Control Imagery 
Learning outcome $s 
scores M SD M SD d 
Transfer test 6.64 3.38 10.22* 4.99 0.86 
Process retention 5.40 3.29 OnlSe 4.29 0.98 
Structure retention 1.52 Del2 2.65 2.42 0.50 
Drawing test 7.36 4.53 E392 3.95 0.95 
Study time (in seconds) 236.40 64.19 319.97* 122.58 0.89 


* Indicates significant difference from the control group. 


p = .002, d = .95, with the imagery group performing better than 
the control group. 

Do the groups differ in time on task? Table 5 shows the 
mean study times and standard deviations for the two groups. 
There was a significant difference between the groups in study 
time, t(46) = 2.99, p = .004, d = .89, in which the imagery group 
spent longer with the presentation than the control group did. 
Therefore, we included study time as a covariate and computed 
ANCOVAs with the performance measures as dependent vari- 
ables. For the transfer score, F(1, 45) = 5.83, MSE = 18.11, p = 
.020; process-retention score, F(1, 45) = 9.39, MSE = 14.77, p = 
.004; and drawing score, F(1, 45) = 7.20, p = .010, the results did 
not change; for the structure-retention score, there no longer was a 
marginally significant effect when study time was included as a 
covariate, F(1, 45) = 2.64, p = .111. The effect of the covariate 
time was in none of the analyses significant. There was also no 
significant interaction between the treatment and time on any of 
the performance measures (all Fs < 1). 

Do the students differ in their motivation and mental effort 
scores? Table 4 shows the mean ratings (and standard devia- 
tions) on motivation, perceived difficulty, and mental effort for the 
two groups. With regard to motivation, there was a significant 
difference between the groups, t(46) = 2.29, p = .026, d = .66, in 
which the imagery group reported more enjoyment with the lesson 
than the control group did. This result is in line with Sadoski and 
Quast (1990; see also Sadoski & Paivio, 2013), who reported 
associations between affective factors and mental imagery activity. 

With regard to perceived difficulty and mental effort, there were 
no significant differences between the groups, t(46) < 1, and 
(45) = 1.61, p = .114. These findings are consistent with Exper- 
iment 1. 

How do the four learning outcome measures relate to one 
another? Table 6 shows a correlation matrix for the four learn- 
ing outcome measures, with significant correlations indicated in 
bold font. Similarly to Experiment 1, the transfer score correlates 
significantly with each of the other three scores. This also applies 
for the drawing score, while the process-retention and structure- 
retention scores each correlate significantly with two other scores. 
There are strong correlations between the drawing score and the 
transfer score, as was found in Experiment 1. Overall, these results 
are very consistent with those of Experiment 1. 

Does drawing performance mediate the effect of imagination 
activity on transfer performance? As in Experiment 1, we 
expected that the effect of condition (control vs. imagery) on 
transfer performance would be mediated by the quality of the 
students’ internal spatial representations of the respiration process. 


Table 6 
Experiment 2: Correlations Among Transfer, Process-Retention, 
Structure-Retention, and Drawing Scores 


Learning outcome 


scores 1 2 3 4 
1. Transfer — .63 -40 59 
2. Process retention _— at, 53 
3. Structure retention —- 51 


4. Drawing — 


Note. Significant correlations are indicated in bold. 
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Simple regression analyses demonstrated that condition (indepen- 
dent variable) significantly affected transfer performance (8 = .40, 
p < .003) and the mediating variable spatial representation (8 = 
44, p = .002; see Figure 5). Also, the mediating variable spatial 
representation predicted transfer performance (B = .59, p < .001). 
Furthermore, multiple regression analysis revealed that the direct 
effect of condition on transfer performance was no longer signif- 
icant when the effect of the mediating variable was controlled 
(68 = .17, p = .199), whereas the effect of the mediating variable 
spatial representation was significant (8 = .52, p < .001). These 
results indicate full mediation of the effect of condition on transfer 
performance via students’ spatial representations. 


General Discussion 


Empirical Contributions 


This study’s primary empirical contribution is that prompts and 
instruction to imagine the process of respiration and the structure 
of the respiratory system while reading an explanative text on how 
human respiration works facilitate transfer and retention perfor- 
mance on immediate and delayed tests. These results extend pre- 
vious findings concerning rote memory of words or facts by 
demonstrating that mental imagery can also affect deeper under- 
standing of explanative text, as shown in superior transfer perfor- 
mance. 

A secondary finding is that adding external pictures—presented 
either before or after imagining the relevant text paragraph—did 
not add any benefits beyond simply imagining. For example, on 
the transfer test, the imagery group outperformed the control 
group, but the picture-before-imagery and picture-after-imagery 
groups did not. On the structure-retention test, the imagery group 
outperformed the control group and outperformed the picture- 
before-imagery group and the picture-after-imagery group. Over- 
all, the results of Experiment 1 indicate the pictures did not 
enhance the imagery process as was proposed but in some cases 
actually weakened it—and this pattern was similar for both of the 
imagery-and-picture groups, providing evidence for the consis- 
tency of the results. 

How can these results be explained? The text consisted of nine 
paragraphs, and each was accompanied by a picture. The students 
of the picture-before- and picture-after-imagery groups may have 
relied on the external presentation of the picture rather than on 
investing effort in creating their own internal picture of the text 
content. The students of the picture-before imagery group may 
have focused on the external presentation of the picture rather than 
on creating an internal picture of the text content. Similarly, the 
students in the picture-after-imagery group knew that a picture was 


being presented after each paragraph. In anticipating that a picture 
would be shown, they may have decided not to put too much effort 
in the imagination process but rather rely on the external picture. 
In both cases, processing the external pictures would draw re- 
sources away from the mental imagery process. Students might 
have primarily focused on processing features of the pictures that 
were not useful in building a dynamic mental representation. 
Processing the external pictures could also have imposed high 
cognitive load on the learner because the learner had to mentally 
hold the pictorial representation in his or her working memory in 
order to integrate it with the textual input (van Merrienboer & 
Ayres, 2005). Mental imagery without external pictures, however, 
seems to foster deeper processing of the text content, presumably 
because the students connected words and their corresponding 
images and relied on these images when constructing a coherent 
runnable model of the respiratory systent. A similar result was 
reported by Schworm and Renkl (2006), who found that self- 
explanation prompts were less effective when instructor explana- 
tions were available than when they were not available, presum- 
ably because the learners relied upon the instructor explanations 
rather than investing effort in self-explanations. More research is 
necessary to disentangle the effects of external representations, 
such as pictures, and the strategic processes of the learner, such as 
mental imagery. 

Third, the positive effect of mental imagination not only was 
found on an immediate test but was replicated on a delayed test 
that was administered 2 days after the study phase. This result is 
consistent with the findings of Sadoski and Quast (1990), who 
found close relations between mental imagery ratings and text 
recall after a delay of 16 days. 

It is further noteworthy that the results of the second experiment 
were highly consistent with the ones of the first experiment. 
Similar to the first experiment, the imagery group showed better 
performance than the control group in transfer, retention, and 
drawing scores, and moreover, the students’ transfer and drawing 
scores were strongly related, with r = .52 in Experiment 1 and 
with r = .59 in Experiment 2. In line with these results, mediation 
analyses indicated that students’ performance in the drawing test 
mediated the effect of imagery instruction on transfer performance 
in Experiments 1 and 2. These results confirm the idea that mental 
imagination fosters deep processing of the text content that con- 
tributes to the durability of these effects. Overall, the fact that the 
imagination effect can be demonstrated on both an immediate test 
and a delayed test and on measures of transfer, retention, and 
drawing points to its robustness. The strong effect sizes (many 
above d = 0.80) point to its practical importance. 
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Figure 5. Mediation model in Experiment 2. “p <.01.** p < .001. See the online article for the color version 


of this figure. 


IMAGINATION EFFECT 59 


Theoretical Implications 


The results are consistent with the idea that the act of imagining the 
spatial relations among elements described in an explanative text can 
prime generative learning processes—including selecting relevant 
elements, organizing them into a coherent structure, and relating them 
to relevant prior knowledge. The act of imagining, when accom- 
plished successfully, can help students build what Paivio (1986) 
called referential connections between corresponding words and im- 
ages, similar to multimedia learning with presented drawings (Mayer, 
2009). The positive effects of mental imagery on transfer performance 
suggest that classic theoretical concepts of mental imagery based on 
associative tasks (Paivio, 1986), mnemonic techniques (Atkinson, 
1975), or recall of facts (Rasco et al., 1975) can be extended with 
respect to how mental imagery facilitates deep understanding of 
complex explanative scientific text. 

Theories that explain the multimedia principle (Butcher, 
2014; Mayer, 2009) provide one starting point because con- 
structing internal pictures in conjunction with corresponding 
text leads to similar effects on retention and transfer perfor- 
mance as processing external pictures in conjunction with cor- 
responding text. This indicates that the benefits of external 
pictures are transferrable to internal pictures (i.e., pictures 
created through the learner’s imagination). When learning from 
text and external pictures, students select, organize, and inte- 
grate words and corresponding pictures. When imagining text 
content, students also select and organize words and transform 
these words’ into mental images of the text content. This re- 
quires the students to draw referential connections between 
words and corresponding images. This process is an intrinsic 
component of the imagination process because the strategy 
cannot be applied without the students creating connections 
between the text and their mental images. This verbal-visual 
connection may be a main benefit of the imagination strategy 
that contributes to enhanced transfer performance on immediate 
and delayed tests. This process is based on the dual coding 
approach described in detail by Sadoski and Paivio (2013). 

Theories of mental model building (Johnson-Laird, 1983) provide 
a second starting point because they specify the nature of the con- 
structed representation. A mental model is a representation that pre- 
serves structural equivalence with the referential content. A mental 
model of the respiratory system therefore would represent the spatial 
relations among the components of the system and their interaction, 
for example, that the diaphragm is located beneath the lungs. When 
students imagine the structure and the functions of the respiratory 
components as described in the text, they are prompted to construct a 
representation that depicts the spatial relations of the system. This 
construction process is based on an interaction of information pro- 
vided by the text and general knowledge such as existent models of 
the learner (Vandierendonck, Dierckx, & van der Beken, 2006). 

Due to the spatial nature of the respiration process, students 
can animate and transform components of the system like a 
runnable mental model that should foster the students’ ability to 
transfer their knowledge to new problems (Hegarty, 2004). One 
main characteristic of the imagery process is that students 
create their mental image step by step (Denis, 2008; Hegarty, 
1992). This stepwise process helps the students to distinguish 
the separate components of the respiratory system and how they 
relate to one another. By contrast, when students perceive an 


external picture, they initially perceive a holistic static image 
and need to put extra effort into processing the text and the 
picture in order to separate its particular components and func- 
tions. Consequently, the images constructed by the imagery 
group might have been more dynamic than the images con- 
structed by the picture-and-imagery groups. Therefore, the im- 
agery group may have found it easier to manipulate these 
images—an activity that is essential in transfer tasks. 


Practical Implications 


The results of the present two experiments point to important 
practical implications for using mental imagination to fostering 
deeper learning. In conjunction with the research reported in the 
introduction, these findings suggest that mental imagery is a pow- 
erful strategy to enhance transfer and retention performance. In 
particular, as a complement to the multimedia principle (Butcher, 
2014; Mayer, 2009), we propose an imagination principle, which 
says that students learn better from explanative scientific text when 
they are asked to imagine a coherent spatial representation depict- 
ing the relations among key elements in the text. An important 
boundary condition is that our students received very clear imag- 
ination prompts specifying what they should imagine to help them 
construct accurate images of the text content. Previous research 
has shown that students struggle to construct accurate mental 
images of a spatial outlay (Denis & Cocude, 1992). Thus, appro- 
priate supports are critical to improving the quality of the mental 
images that will in turn affect test performance. A further advan- 
tage of mental imagery strategies is that they are easy to implement 
in reading education programs and educational settings. Contrary 
to related, better established strategies like drawing strategies, 
mental imagery strategies do not require the students to invest 
additional resources in externalizing their images (Leutner et al., 
2009; Leutner & Schmeck, 2014). 


Limitations and Future Directions 


Some limitations of the study relate to materials, methodology, 
participants, and context. Concerning materials, our imagination 
treatment includes prompts to imagine the text along with the 
names of specific elements to include (e.g., thoracic cavity), so it 
is not possible to determine which aspects of the treatment—the 
imagining part, the elements part, or both—caused the improve- 
ment in learning. A potentially useful finding is that the imagina- 
tion effect was mediated by the students’ spatial representations. 
However, as mediation analysis does not allow us to draw causal 
inferences, further research is required to examine which aspects 
of the imagination treatment are most important in producing an 
improvement in transfer test performance. This methodological 
limitation could be addressed in future work by comparing instruc- 
tions to “study” particular elements of the text versus to “imagine” 
particular elements of the text. 

Furthermore, it should be noted that one prerequisite for the 
imagination strategy to function is that the text employed is written 
clearly enough so that students actually can imagine it because 
students have to rely solely on the text when constructing their 
mental pictures. Thus, it is worth investigating how to support 
students as they imagine and animate dynamic processes that are 
described in the text. A limitation of experiments that utilize 
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delayed tests, as we did in Experiment 2, is that it is difficult to 
collect reliable data on whether students studied the materials 
when they took the delayed test. Concerning participants, college 
students participated in the study so further research is required to 
determine whether the imagination effect can apply to other age 
groups and learners with different characteristics. Finally, concern- 
ing context, this was a short-term laboratory study, which pro- 
duced promising results, so further work is needed to determine 
how an imagination strategy affects transfer problem solving in 
learning scientific material and how the imagination effect can be 
applied in courses involving multimedia learning. 
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Appendix A 


Introduction and Paragraphs of the Text on the Respiratory System 
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Topic 


Text 
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Introduction 


Structure of the 
Nervous System 


Steps in the Nervous 
System to Control 
Breathing 


Structure of the 
Thoracic Cavity 


Structure of the 
Airway System 
Process of Inhaling 


Structure of the 
Exchange System 


Structure of the 
Circulatory 
System 


Process of 
Exchanging 


Process of Exhaling 


Respiration is the process that moves air in and out of the lungs. Through respiration oxygen is delivered to where 
it is needed in the body and carbon dioxide is removed from the body. Respiration involves three phases: 
inhaling, exchanging and exhaling. The respiratory process is controlled by the nervous system. 

The respiratory center is located in the rear, bottom part of the brain, near the back of the neck. The respiratory 
center of the brain is connected to a pathway of nerves that leads down from the spinal cord to connect with 
muscles controlling the diaphragm and rib cage. 

When the brain detects the need for more oxygen in the bloodstream, the respiratory center in the brain sends out a 
signal to inhale. The signal moves along the pathway of nerves to muscles controlling the diaphragm and rib 
cage. When the brain detects the need for less carbon dioxide in the bloodstream, the respiratory center in the 
brain terminates the signal to inhale. The signal to inhale stops moving along the pathway of nerves to the 
muscles controlling the diaphragm and rib cage. 

The thoracic cavity is the space in the chest that contains the lungs. It is surrounded by the rib cage, which can 
move slightly inward or outward, and has the diaphragm on the bottom, which has a dome that can move 
downward. The main muscles involved in respiration are the diaphragm and the rib muscles. The diaphragm is 
located underneath the lungs. It lines the lower part of the thoracic cavity, sealing it off air-tight from the rest of 
the body. The rib muscles are attached to the ribs, which in turn encircle the lungs. When in the relaxed 
position, the ribs are slightly inward and the diaphragm dome curves upward. 

From the nose and the mouth the windpipe leads to the bronchial tubes, which branch off into the right and the left 
lung. There they branch off into finer tubes. 

During inhaling, a signal from the brain to inhale causes the dome of the diaphragm to contract downward and the 
rib cage to move slightly outward creating more space in the thoracic cavity into which the lungs can expand. 
Air is drawn in through the nose or mouth, moves down through the windpipe and bronchial tubes to tiny air 
sacs in the lungs. 

Tiny grape-like air sacs, called alveoli, are grouped together in the lungs at the bronchial tubes. Each air sac is 
surrounded by tiny blood vessels called capillaries. On one side of the air sac the surrounding capillaries carry 
oxygen and on the other side they carry carbon dioxide. Oxygen-carrying capillaries connect air sacs to larger 
blood vessels called arteries and are represented as red because they contain an abundance of oxygen. Carbon- 
dioxide-carrying capillaries connect larger blood vessels called veins to the air sacs and are represented as blue 
because they contain an abundance of carbon dioxide. 

Arteries (red blood vessels) run one-way from the lungs, through the heart, which is somewhat below the lungs, to 
the cells of the body. Arteries transport oxygen, which is used by the cells of the body to make energy. Veins 
(blue blood vessels) run one-way in the opposite direction from the cells of the body, through the heart, to the 
lungs. Veins transport carbon dioxide, which is a waste gas produced in the cells of the body. The heart is a 
pump that keeps the blood flowing in the veins and arteries. 

The exchange of oxygen and carbon dioxide takes place in the connection between air sacs and capillaries. Oxygen 
molecules in the inhaled air move to the capillaries running nearby, and carbon dioxide molecules move from 
the capillaries into the air sacs in the lungs. The capillaries carry the oxygen to arteries, which transport it, 
through the heart, to the cells of the body. At the same time, carbon dioxide travels in veins from the cells of the 
body, through the heart, to capillaries running next to the air sacs. 

The carbon-dioxide-rich air in the air sacs is drawn out of the lungs by exhaling. When the brain turns off the 
signal to inhale, the diaphragm and the rib muscles relax. The dome of the diaphragm moves upward again and 
the ribs move slightly inward. As a result, the thoracic cavity becomes smaller creating less room for the lungs. 
Air containing carbon dioxide is forced out of the lungs through the bronchial tubes and windpipe to the nose 
and mouth, where it leaves the body. 
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Appendix B 


Imagination Instructions for the Nine Text Paragraphs 
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Paragraph 


Instruction 
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Please imagine the structure of the nervous system consisting of the brain, nerves, diaphragm, and rib muscles. 

Please imagine the steps in the nervous system when the brain sends a signal to the diaphragm and rib muscles. 

Please imagine the structure of the thoracic portion, consisting of the thoracic cavity, Jungs, rib cage, and diaphragm. 

Please imagine the structure of the airway portion, consisting of the nose, mouth, windpipe, bronchial tubes, Jungs, and air sacs. 

Please imagine the steps in the thoracic cavity and the airway when the diaphragm and rib muscles receive a signal to inhale. 

Please imagine the structure of the exchange system consisting of air sacs, oxygen-carrying capillaries, carbon-dioxide-carrying 
capillaries, veins, and arteries. 

Please imagine the structure of the circulatory system consisting of lungs, arteries, veins, heart, and cells of the body. 

Please imagine the steps in the exchange system and the circulatory system for the process of exchanging. 

Please imagine the steps in the thoracic cavity, airway, diaphragm and rib muscles for the process of exhaling. 
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While it is hypothesized that providing instruction based on individuals’ preferred learning styles 
improves learning (i.e., reading for visual learners and listening for auditory learners, also referred to as 
the meshing hypothesis), after a critical review of the literature Pashler, McDaniel, Rohrer, and Bjork 
(2008) concluded that this hypothesis lacks empirical evidence and subsequently described the experi- 
mental design needed to evaluate the meshing hypothesis. Following the design of Pashler et al., we 
empirically investigated the effect of learning style preference with college-educated adults, specifically 
as applied to (a) verbal comprehension aptitude (listening or reading) and (b) learning based on mode of 
instruction (digital audiobook or e-text). First, participants’ auditory and visual learning style preferences 
were established based on a standardized adult learning style inventory. Participants were then given a 
verbal comprehension aptitude test in both oral and written forms. Results failed to show a statistically 
significant relationship between learning style preference (auditory, visual word) and learning aptitude 
(listening comprehension, reading comprehension). Second, participants were randomly assigned to | of 
2 groups that received the same instructional material from a nonfiction book, but each in a different 
instructional mode (digital audiobook, e-text), and then completed a written comprehension test imme- 
diately and after 2 weeks. Results demonstrated no statistically significant relationship between learning 
style preference (auditory, visual word) and instructional method (audiobook, e-text) for either immediate 
or delayed comprehension tests. Taken together, the results of our investigation failed to statistically 
support the meshing hypothesis either for verbal comprehension aptitude or learning based on mode of 
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instruction (digital audiobook, e-text). 


Keywords: learning styles, listening and reading comprehension, audiobooks, e-text 


Teaching to individuals’ perceived learning styles in hopes that 
they will achieve greater academic success is common practice 
within the field of education. Not only does the learning styles 
concept have widespread acceptance among educators (Dekker, 
Lee, Howard-Jones, & Jolles, 2012) but also it is accepted among 
the general public (Pashler, McDaniel, Rohrer, & Bjork, 2008). 
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The learning style literature, as well as learning style inventories, 
differs widely in the way that learning styles are conceived and 
assessed (see Coffield, Moseley, Hall, & Ecclestone, 2004, and 
Pashler et al., 2008, for review). For example, in the Gregorc Style 
Delineator (Gregorc, 1982), learning styles are defined by percep- 
tion (concrete or abstract) and ordering (sequential or random). 
The Kolb’s Learning Style Inventory (1985) emphasizes experi- 
ential learning and includes accommodating, diverging, con- 
verging, and assimilating styles. Herrmann’s Brain Dominance 
Instrument (1996) categorizes learners as theorists (cerebral, 
left: the rational self), organizers (limbic, left: the safe-keeping 
self), innovators (cerebral, right: the experimental self), and 
humanitarians (limbic, right: the feeling self). Dunn and Dunn’s 
Learning Styles Inventory (Dunn, Dunn, & Price, 1989) con- 
centrates on modality-specific strengths and weaknesses (e.g., 
visual, auditory, tactile, and kinesthetic processing). In the 
current study, we focused on verbal comprehension, specifi- 
cally, the extent to which verbal comprehension may be influ- 
enced by the modality of input: auditory (digital audio) or 
visual (e-text). 

While the learning styles literature has been extensively dis- 
cussed and reviewed, there are considerably more theoretical and 
descriptive discussions on this topic than there are empirical stud- 
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ies. For example, Cassidy (2004) described the central themes and 
issues surrounding learning styles and the many instruments avail- 
able for the measurement of learning styles with the goal of 
promoting research in the field. Kozhevnikoy (2007) presented a 
literature review on cognitive styles, which served as a basis for 
the author’s theory that suggests that cognitive styles represent 
heuristics that can be identified at multiple levels of information 
processing, from perceptual to metacognitive, and that individ- 
uals can be grouped according to the type of regulatory function 
they exert. Sternberg, Grigorenko, and Zhang (2008) divided 
learning and thinking into two basic styles: ability based and 
personality based, and advocated that both are important for 
instruction and assessment. They argued that teachers need to 
take into consideration differences in how students learn and 
think and design instruction accordingly to obtain optimal in- 
structional outcomes. 

The importance of evaluating students’ learning styles and de- 
veloping instructional methods that teach to specific learning 
styles has gained considerable support in the field of education, 
with many organizations and companies offering professional de- 
velopment courses for teachers and educators focused on the topic 
of learning styles. For this reason, Pashler, McDaniel, Rohrer, and 
Bjork (2008) were charged with reviewing the empirical evidence 
pertaining to the importance of assessing and teaching to students’ 
learning styles for the journal Psychological Science in the Public 
Interest. In their review, they define learning styles as “the concept 
that individuals differ in regard to what mode of instruction or 
study is most effective for them. . .. The most common—but not 
the only—hypothesis about the instructional relevance of learning 
styles is the meshing hypothesis, according to which instruction is 
best provided in a format that matches the preferences of the 
learner (e.g., for a ‘visual learner,’ emphasizing visual presentation 
of information; p. 105).” After reviewing the literature, they found 
that while there is evidence that, if asked, both children and adults 
indicate preferences as to how they favor information be presented 
to them, and there is also evidence that people have specific 
aptitudes for processing different types of instruction, there is 
limited empirical evidence as to whether providing instruction in 
an individual’s preferred learning style (i.e., listening for those 
with an auditory learning style or reading for those with a visual 
learning style) improves learning. Furthermore, they also con- 
cluded that the definitive study showing that individuals with a 
preferred auditory learning style learn better when listening rather 
than reading, and conversely, that those with a preferred visual 
learning style learn better when reading rather than listening, had 
not been conducted. 

Given the lack of credible validation of learning-styles-based 
instruction, Pashler et al. (2008) described a three-step experimen- 
tal design of the study that would need to be conducted, as well as 
the pattern of data that would need to be found, in order to 
conclude empirically that learning is significantly improved when 
individuals receive instruction tailored to their asserted learning 
style. In Step 1, participants must be divided into groups on the 
basis of their learning style. In Step 2, participants from each group 
must be randomly assigned to receive one of multiple instructional 
methods. In Step 3, participants must complete an assessment of 
the material that is the same for all students. For the learning styles 
meshing hypothesis to be supported, data analysis must reveal a 
specific type of interaction between learning style and instructional 


method. That is, learning is optimal when individuals receive 
instruction in their preferred learning style, and the instructional 
method that proves most effective for individuals with one learn- 
ing style is not the most effective method for individuals with a 
different learning style. 

Pashler et al. (2008) also pointed out that educators as well as 
the general public fail to distinguish between learning style pref- 
erences and learning aptitude. They stated that“[t]here is, after all, 
a commonsense reason why the two concepts could be con- 
flated: Namely, different modes of instruction might be optimal 
for different people because different modes of presentation 
exploit the specific perceptual and cognitive strengths of dif- 
ferent individuals, as suggested by the meshing hypothesis” 
(pp. 109-110). However, the relationship between learning 
style preference and learning aptitude, specifically as it relates 
to the meshing hypothesis and verbal comprehension, has not 
been established empirically. 

In 2012, Dekker, Lee, Howard-Jones, and Jolles reported that 
94% of educators believed that students perform better when they 
receive information in their preferred learning style (e.g., auditory, 
visual, kinesthetic). Given this continued widespread belief and the 
influence of learning styles on educational practice, coupled with 
the importance of verbal comprehension on educational outcomes, 
we conducted an investigation of the meshing hypothesis as it 
pertains to verbal aptitude and learning. We implemented the 
methodology and analyses proposed by Pashler et al. (2008) in 
order to directly test the following two research questions: 

1. What is the extent to which learning style preferences (audi- 
tory, visual) equate to learning aptitudes (listening comprehension, 
reading comprehension)? 

2. What is the extent to which learning style preferences and/or 
learning aptitudes predict how much an individual comprehends 
and retains based on mode of instruction (audiobook, e-text)? 

In the first research question, we investigated the relationship 
between learning style preferences (as measured by a standardized 
learning style inventory) and learning aptitudes (as measured by a 
listening and reading comprehension assessment). Specifically, as 
applied to the relationship between verbal aptitude and learning 
style preference, the meshing hypothesis predicts that (a) there will 
be a positive correlation between auditory learning style prefer- 
ence and listening comprehension, (b) there will be a positive 
correlation between visual word learning style preference and 
reading comprehension, and (c) individuals with a visual learning 
style preference will comprehend better when they read rather than 
listen, and conversely, individuals with an auditory learning style 
preference will comprehend better when they listen rather than 
read. 

In the second research question, we investigated the extent to 
which learning style preferences (auditory, visual) and/or learning 
aptitudes (listening comprehension, reading comprehension) pre- 
dict how much an individual will learn and retain based on two 
different modes of instruction (audiobook, e-text). Specifically, the 
meshing hypothesis predicts that individuals with a visual learning 
style preference learn more when they read e-text rather than when 
they listen to an audiobook, and conversely, individuals with an 
auditory learning style preference learn more when they listen to 
an audiobook rather than read e-text. Analogous predictions would 
be expected with regards to the relationship between listening 
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comprehension aptitude and learning from an audiobook and read- 
ing comprehension aptitude and learning from e-text. 


Method 


Participants 


In order to be included in this study, participants had to meet the 
following inclusionary criteria: age 25—40 years; college educated 
(bachelor’s degree only); native speakers of English; normal hear- 
ing and vision (with correction); and no self-reported history of 
neurological or learning impairments. Potential participants out- 
side this age range, who had more advanced degrees beyond a 
bachelor’s degree, who had not graduated from college, or who 
had a history of neurological or learning disabilities were ex- 
cluded. Based on these criteria, 121 participants from the New 
York City metropolitan area were selected. Of the total population 
of 121 subjects, 62 were male and 59 were female. The mean age 
of the participants was 30.6 years (SD = 4.4). All participants 
completed 16 years of education. This study examined the two 
research questions. For Research Question 1, the entire population 
of 121 individuals participated. These 121 individuals were then 
randomly assigned to four groups. Two of these groups (61 par- 
ticipants) participated in Research Question 2. The remaining 
participants who had been randomly assigned to the other two 
groups participated in a different study that was not focused on 
learning styles. The 61 participants in Research Question 2 were 
randomly assigned to a listening condition (n = 30) or a reading 
condition (n = 31). The analyses of Research Question 2 focused 
only on those participants who could be categorically classified as 
having an auditory or visual word learning style and who were 
randomly assigned to either a listening or reading condition. The 
final four subgroups included in Research Question 2 analyses 
were listening condition with auditory learning style (x = 11), 
listening condition with visual word learning style (n = 10), 
reading condition with auditory learning style (n = 10), and 
reading condition with visual word learning style (n = 10). 

This study was conducted in accordance with the prescribed 
standards of the institutional review board of Rutgers University— 
Newark. All participants provided informed consent and were 
financially compensated for their participation. 


Learning Styles Assessment 


Prior to on-site testing, participants completed an online stan- 
dardized learning styles preference inventory. Pashler et al. (2008) 
identified the Dunn and Dunn learning styles model as being one 
of the most popular learning styles assessment tools because of the 
constructs included as well as the broad age range of the assess- 
ments offered—from children as young as 3 years old through 
adults. For this study, we selected the adult version, the Building 
Excellence (BE) Online Learning Styles Assessment Inventory for 
ages 17 and older (Rundle & Dunn, 2010). The BE Learning Styles 
Inventory is a self-administered online survey that requires 20-25 
min for completion. The assessment measure asks participants to 
decide if they strongly disagree, disagree, are uncertain, agree, or 
strongly agree after reading statements indicating, for example, 
whether the respondent remembers new information better by 
reading about it or by listening to a discussion about it (Rundle & 


Dunn, 2010). The BE Learning Styles Inventory assesses individ- 
ual learning and productivity styles based upon six domains: 
perceptual, psychological, environmental, physiological, emo- 
tional, and sociological. The perceptual domain is subdivided into 
the following six elements: auditory (input), visual picture, visual 
word, tactual, kinesthetic, and auditory verbal (output). The BE 
Learning Styles Inventory provides an individual’s strengths and 
weaknesses pertaining to these six possible perceptual learning 
styles. For each learning style preference, individuals are placed 
into one of five bins that are continuous, ranging from very weak 
to very strong. For example, the five bins for auditory are classi- 
fied as (1) strong less auditory, (2) moderate less auditory, (3) it 
depends, (4) moderate more auditory, and (5) strong more audi- 
tory. Within each bin, there is a 3-point range with the exception 
of Bin 3 (it depends) that has a 5-point range, for a total of 17 
possible placements along the continuum for each perceptual ele- 
ment. For the purpose of this study, we focused only on those 
elements (auditory and visual word) that most relate to listening 
and reading comprehension, respectively. 

The BE Learning Styles Inventory provides personalized reports 
that convert an individual’s numerical score into instructional 
recommendations. For example, if a participant scores strong less 
auditory or moderate less auditory (Bin 1 or Bin 2, respectively/ 
corresponding Placements 1—6), the recommendation prescribed 
by the BE Learning Styles Inventory would be that because lis- 
tening is not a strength, the participant should rely on a stronger 
style when learning new material. If a participant scores strong less 
visual word or moderate less visual word (Bin | or Bin 2, respec- 
tively/corresponding Placements 1—6), the recommendation pre- 
scribed would be that because reading is not a strength, the 
participant should rely on a stronger style when learning new and 
difficult information. If a participant scores it depends in either 
auditory or visual word (Bin 3/corresponding Placements 7-11), 
the BE Learning Styles recommendation acknowledges that the 
participant is indifferent to the modality. He or she is encouraged 
to use one of his or her strengths when learning new information. 
If a participant scores moderate more auditory/visual or strong 
more auditory/visual (Bin 4 or Bin 5, respectively/corresponding 
Placements 12-17), the individual is advised to use that style most 
of the time when learning. While the automated computer scoring 
system generated scores and reports for each participant, the par- 
ticipants were not informed about the purpose of the study or given 
access to their scores or reports or given any feedback from this 
survey. 

In this study, learning styles data were analyzed using two 
different scoring procedures. For correlation and regression anal- 
yses, data were analyzed using the full standard continuous 17- 
point scoring method provided by the BE Learning Styles Inven- 
tory. Three variables were used: BE auditory (range = from +1 
to +17), BE visual word (range = from +1 to +17), and the 
difference between BE auditory and BE visual word (range from 
17 to +17). In this study, participants’ BE auditory scores ranged 
from +2 to +17, their visual word scores ranged from +5 to +17 
and the difference between BE auditory minus BE visual word 
scores ranged from —11 to +8. (M = —0.92, SD = 3.98). 

In addition, in order to follow the analysis prescribed by Pashler 
et al. (2008), which addressed the meshing hypothesis directly, 
individuals must first be divided into groups on the basis of their 
preferred learning style. For this purpose, participants were clas- 
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sified categorically as having primarily either an auditory or visual 
word learning style. We used the five bin categories provided by 
the BE Learning Styles Inventory: strong less auditory/visual 
word = 1; moderate less auditory/visual word = 2: it depends = 
3; moderate more auditory/visual word = 4; and strong more 
auditory/visual word = 5. According to the BE Learning Styles 
Inventory, only participants who scored moderate to strong more 
auditory (either a 4 or 5) as well as it depends or moderate to 
strong less visual word (3, 2, 1) were instructed to use the auditory 
modality “much of the time.” For the purposes of this analysis, 
these participants were classified as having an auditory learning 
style (n = 37). Similarly, only participants who scored moderate to 
strong more visual word (either a 4 or 5) as well as it depends or 
moderate to strong less auditory (3, 2, 1) were instructed to use the 
visual word modality “much of the time.” These participants were 
classified as having a visual word learning style (n = 31). Of the 
121 individuals who participated in Research Question 1, 53 
participants could not be categorically classified as either auditory 
or visual word learners and, as such, were not included in the 
analyses that required categorical classification. 


Verbal Comprehension Aptitude Measure 


The goal of this study was to determine the extent to which 
learning style preference (auditory, visual) and/or verbal aptitude (listen- 
ing comprehension, reading comprehension) relates to the effective- 
ness of instructional method (audiobook, e-text). Because there is 
no standardized assessment designed to directly compare listening 
and reading comprehension aptitude in adults, we developed a 
verbal comprehension aptitude test in both a listening and reading 
format using matched passages from two equivalent forms of the 
fourth edition of the Gray Oral Reading Test (GORT-—4). 
GORT-4 is a standardized assessment measure composed of lev- 
eled passages that objectively measure oral reading rate, accuracy, 
fluency, and comprehension, as well as alerts to possible learning 
exceptionalities (Weiderholt & Bryant, 2000). GORT-—4 (age 
range: from 6.0 to 18.11 years) consists of 13 passages that 
become increasingly difficult as the examinee progresses. After 
pilot testing all passages in college-educated adults, we selected 
passages 9, 10, 11, and 13 for use in this study. Passages 1-8 and 
12 did not provide sufficient individual differences in our college- 
educated population and were not included. None of the individ- 
uals who participated in pilot testing were included in the current 
study. The selected passages ranged from 148 to 167 words (M = 
158, Mdn = 157). Each passage was followed by five comprehen- 
sion questions. To assess listening comprehension, we converted 
the selected passages from Form B of the GORT-4 into a digital 
audio recording. A professional audiobook narrator, who read at a 
steady pace and with natural intonation, recorded the passages. We 
will refer to this assessment as the Listening Aptitude Test (L-AT). 
To assess reading comprehension, we asked each participant to 
read the selected passages from Form A of the GORT-4 silently. 
This assessment will be referred to as the Reading Aptitude Test 
(R-AT). 

Each of the 121 participants in this study was tested on both the 
L-AT and the R-AT. The order in which the L-AT and the R-AT 
were taken was counterbalanced to reduce the chance that the 
order of testing would adversely influence the results. Half of the 
participants completed the R-AT and then L—AT, where they read 


the first four passages and then listened to the remaining four 
passages; the other half of the participants completed the L-AT 
and then R-AT, where they listened to the first four passages and 
then read the remaining four passages. Participants read each 
passage silently from a computer screen or listened through head- 
phones to a digital audio recording. 

Immediately after they read or listened to each passage, partic- 
ipants answered the five corresponding multiple-choice questions 
for that passage. Note that part of the answer from one of the 
questions on the R-AT was accidentally omitted. Therefore, data 
could only be collected from 19 of the 20 questions. To assure that 
the R-AT and the L~AT remained equivalent, the comparable 
question from the L-AT was also deleted from all analyses. The 
protocol designed by Pashler et al. (2008) to assess the meshing 
hypothesis requires individuals to complete an assessment that is 
the same for all participants. All participants answered the com- 
prehension questions in the same (written) format. We chose to 
focus on this response format because most tests of comprehension 
are administered in writing. The program required a response for 
each question before the participant could proceed to the next 
question. Participants were not permitted to re-read or re-listen to 
any passage nor were they allowed to use the passage as a refer- 
ence when answering the questions. No feedback was given. 


Instructional Unit 


Two modes of instruction were investigated for the same unit 
(audiobook, e-text). The content used across both of these instruc- 
tional conditions was the preface and Chapter 17 of the nonfiction 
book, Unbroken: A World War II Story of Survival, Resilience, and 
Redemption, written by Laura Hillenbrand and read by Edward 
Hermann. The total content contained 3,184 words. Forty-eight 
multiple-choice questions were designed to assess the participants’ 
comprehension. These 48 questions will be referred to as the 
Unbroken comprehension test. 

The question set was developed by a certified teacher of English 
(B.R.), who serves on the Pennsylvania State Standardized Assess- 
ment Panel where she reviews reading assessment items for con- 
tent, rigor, alignment, bias, and universal and technical design. 
Questions were piloted for difficulty on a sample of 10 individuals 
meeting the eligibility requirements for participants in this study 
but were not participants in the study. The Unbroken comprehen- 
sion test was given twice, once immediately following completion 
of the passage (Time 1) and again 2 weeks later (Time 2). 


Procedure 


After completing the Listening Aptitude Test (L-AT) and Read- 
ing Aptitude Test (R-AT), participants were randomly assigned to 
one of two instructional conditions for the Unbroken portion of the 
study. Participants in each of the instructional conditions received 
the preface and Chapter 17 of Unbroken, presented in one of two 
different formats. In the audiobook condition, participants used 
headphones to listen to both the preface and Chapter 17 of Un- 
broken presented on an electronic tablet in digital audiobook 
format. In the e-text condition, participants read both the preface 
and Chapter 17 of Unbroken presented on an electronic tablet in 
e-text format. A research assistant pre-cued the e-text or audio, as 
well as monitored the participant to assure that there were no 
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interruptions and that the participant understood how to use the 
equipment, was on-task, and did not extend reading/listening be- 
yond the prescribed passages. Prior to administration of the pas- 
sage for the audio condition, the volume was adjusted to a com- 
fortable level. The audio condition lasted 16 min 24 s and was read 
at a pace of 149 words per minute. Participants in the e-text 
condition read at their own pace without time restraint. The 
replaying/fast-forwarding of audio and the re-reading/skipping of 
text were prohibited. The research assistant monitored partici- 
pants’ compliance. 

Upon completion of Chapter 17, participants proceeded imme- 
diately (Time 1) to take the Unbroken comprehension test and 
answer 48 questions derived from the preface and Chapter 17. 
Participants were not allowed to use the e-text or digital audiobook 
as a reference. Each question was individually displayed in written 
text only on a computer screen, as is common in standard testing 
practices. The online multiple-choice assessment required a re- 
sponse for each question before the examinee could proceed to the 
next question. No feedback was given. In addition to the online, 
on-site immediate comprehension assessment (Time 1), partici- 
pants completed the same multiple-choice assessment online 2 
weeks later (Time 2) in order to evaluate their retention of the 
information in the story. 


Results 


Analyses for Research Question 1 


Research Question 1 addresses the extent to which learning style 
preferences (auditory, visual word) as measured by the BE Learn- 
ing Style Inventory equate to learning aptitudes (listening compre- 
hension, reading comprehension) as measured by the L-AT and 
the R—AT. 

To evaluate the equivalence of the L-AT and the R-AT for 
assessing comprehension aptitude in this population (V = 121), we 
calculated a paired-samples t test comparing the mean of the L-AT 
(M = 13.9, SD = 3.4) to the mean of the R-AT (M = 12.8, SD = 
2.8). A significant difference was found, t(120) = 3.54; p < .01. 
The mean of the L-AT was significantly higher than the mean on 
the R-AT with an effect size of Cohen’s d = 0.32. Although this 
difference was not ideal, it is important to note that the main 
hypothesis pertaining to the interaction between learning style and 
mode of instruction does not require that the R-AT and L-AT 
measures be equivalent. 

Analyses using categorical learning style variables to predict 
learning aptitude: Implementing the Pashler et al. (2008) 
method. Pashler et al. (2008) prescribed a specific methodology 
for assessing the meshing hypothesis that requires that participants 
be categorically classified into two discrete learning styles (audi- 
tory learners or visual word learners). To follow this methodology 
explicitly, participants were classified into two discrete learning 
style categories: auditory learners (n = 37) or visual word learners 
(n = 31) as described in the Methods section. 

A one-way multivariate analysis of variance (MANOVA) was 
calculated examining the effects of learning style preference 
groups (auditory, visual word) on the L-AT and R-AT scores to 
determine if learning style preference (auditory, visual word) pre- 
dicts listening or reading comprehension aptitude. A significant 
effect of aptitude test (L-AT vs. R-AT) was found, F(1, 66) = 


12.7; p < .05, with an effect size ny = 0.16, indicating that 
participants performed significantly better on one aptitude test (L—AT: 
M = 14.1, SD = 3.5) than on the other aptitude test (R-AT: M = 
12.8, SD. = 2.9). There was also a significant effect of learning 
styles preference (auditory vs. visual word), F(1, 66) = 6.9; p < 
05, with an effect size n* = 0.09, indicating that participants in 
one learning styles preference group (visual word: M = 14.4, 
SD = 4.0) performed significantly better than the participants in 
the other learning styles preference group (auditory, M = 12.6, 
SD = 3.6). There was not a significant aptitude test (L-AT, R-AT) 
by learning styles preference (auditory, visual word) interaction, 
F(1, 66) = 0.34; p > .05. Further inspection of these results using 
one-way analyses of variance (ANOVAs) show that overall, par- 
ticipants in the visual word learning style group scored signifi- 
cantly higher on the L-AT, F(1, 66) = 5.48, p < .05 (M = 15.16, 
SD = 3.10), than participants in the auditory learning style group 
(M = 13.22; SD = 3.65). Participants in the visual word learning 
style group also scored significantly higher on the R-AT, F(1, 
66) = 4.91; p < .05 (M = 13.58, SD = 2.50), than participants in 
the auditory learning style group (M = 12.08; SD = 2.99). These 
results indicate that participants who had a visual word learning 
style preference were significantly better at both listening and 
reading comprehension, compared to those who had an auditory 
learning style preference. 

According to Pashler et al. (2008), acceptable evidence in sup- 
port of the meshing hypothesis would show a crossover between 
two learning style preference groups (auditory, visual word) and 
listening and reading comprehension aptitude (L-AT, R-AT), as 
shown in Figure 1A. Figure 1B shows an example taken from 
Pashler et al. (2008) of one form of unacceptable evidence for the 
meshing hypothesis, where both auditory and visual word learning 
style preference groups score higher on the same method, and 
hence there is no crossover. Figure 1C shows the data from the 
current study. As shown in Figure 1C, contrary to the crossover 
pattern that would be expected to support the meshing hypothesis, 
the auditory and visual word learning style preference groups both 
scored higher on listening comprehension than on reading com- 
prehension. It is important to note that not only was the L-AT 
performance better for both groups but also the superiority of the 
L-AT over the R-AT was similar for both groups. According to 
Pashler et al. (2008), this pattern corresponds to one example of 
unacceptable evidence in support of the meshing hypothesis. 

However, classification of participants into two discrete groups 
reduces the sensitivity of continuous variables and also reduced the 
sample size by including only those participants who had a clear auditory 
or visual word learning style preference. To mitigate these concems, we 
performed a final series of correlation and step-wise multiple re- 
gression analyses (n = 121) to evaluate whether there was a 
significant relationship between learning style preference (BE 
Learning Styles Inventory) and learning aptitude (L-AT, R—AT). 
For these analyses, variables from the BE Learning Styles Inven- 
tory based on the continuous 17-point standard BE scoring system 
were used to predict verbal comprehension aptitude scores on the 
L-AT and R-AT. 

Correlation and regression analyses for Research Question 
1. The relationship between learning style preference (BE Learn- 
ing Styles Inventory) and listening and reading comprehension 
aptitude (L-AT and R-AT) was evaluated by a series of simple 
correlation analyses as well as stepwise multiple regression anal- 
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Figure 1. Graph A displays the pattern of evidence required to support 


the meshing hypothesis while Graph B displays one of several patterns of 
evidence that would constitute unacceptable evidence (according to Pashler 
et al., 2008). Graph C displays the results from the current study, which 
show that there is no crossover effect. Bars represent standard errors. The 
95% confidence interval (CI) for the Listening Aptitude Test (L-AT) 
ranged from 12.10 to 14.33 for participants with an auditory learning 
preference and from 13.94 to 16.39 for participants with a visual learning 
preference. The 95% CI for the Reading Aptitude Test (R-AT) ranged 
from 11.17 to 12.99 for participants with an auditory learning preference 
and from 12.58 to 14.58 for participants with a visual learning preference. 


yses. For these analyses, the following variables from the BE 
Learning Styles Inventory were used based on the 17-point stan- 
dard BE scoring system: BE auditory score, BE visual word score, 
and the difference between the BE auditory score and the BE 
visual word score (BE auditory score — BE visual word score), to 
predict verbal comprehension aptitude. Verbal comprehension ap- 
titude outcomes of interest included (a) predicting listening apti- 
tude (L—AT raw score), (b) predicting reading aptitude (R-AT raw 
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score), and (c) predicting the difference between listening and 
reading aptitude (L-AT — R-AT raw score). Table 1 presents the 
means and standard deviations for each of the BE learning styles 
and verbal comprehension variables as well as the correlation 
matrix for these variables, and Table 2 presents the results of the 
multiple regression analyses. 

Predicting listening comprehension aptitude from learning 
style preference scores. The meshing hypothesis predicts a pos- 
itive correlation between learning style preference and aptitude. 
That is, if auditory learning style equates to listening comprehen- 
sion aptitude, as auditory learning style preference scores increase, 
listening comprehension aptitude score would also increase. As 
seen in Table 1, contrary to expectation based on the meshing 
hypothesis, the correlation between auditory learning style prefer- 
ence (based on the BE auditory score) and listening comprehen- 
sion (based on the L—AT score) was negative (-.31, p < .01). To 
further test whether other learning style variables influence listen- 
ing comprehension aptitude, we calculated multiple linear regres- 
sion analyses to determine the extent to which participants’ listen- 
ing comprehension aptitude (L-AT) could be predicted based on 
their BE auditory learning style score, BE visual word learning 
style score, and the difference between their BE auditory and BE 
visual word scores. As seen in Table 2, a significant regression 
equation was found, F(1, 119) = 12.96, p < .001, with an R? of 
.10. The only BE learning style variable that contributed signifi- 
cantly to the listening comprehension score was the BE auditory 
learning style score. This single variable contributed a correlation 
coefficient of R = .31, R* = .10 (SE = 3.28), p < .001. Partici- 
pants’ predicted listening comprehension score is equal to 17.20 
(constant) + —0.30 (BE auditory learning style score), indicating a 
negative relationship between the BE auditory learning style score 
and the listening comprehension aptitude score. The coefficient 
model shows that for every 1 point that the BE auditory learning 
style score decreased, the listening comprehension score increased 
0.30 points. BE visual word learning style score and the difference 
between BE auditory learning style score and BE visual word 
learning style score failed to contribute any significant variance 
beyond that already accounted for by the BE auditory learning 
style score. This analysis demonstrated that only the BE auditory 
learning style score accounted for a significant portion of the 
listening comprehension variance. However, contrary to what 
would be predicted by the meshing hypothesis, this relationship 


Descriptive Statistics and Correlation Matrix for the Predictor Variables Entered into the 
Multiple Regression Aptitude Model for Research Question 1 


Variable M 

1. Listening aptitude 13.87 
2. Reading aptitude 12.81 
3. Difference between listening and 

reading aptitude 1.67 
4. BE auditory learning style 10.98 
5. BE visual word learning style 11.89 
6. Difference between BE auditory and 

visual word learning styles —0.92 


Note. n = 121; BE = Building Excellence. 
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Table 2 

Coefficients for the Significant Predictor Variables Entered into 
the Multiple Regression Model for Listening Aptitude, Reading 
Aptitude, and Difference Between Listening and Reading 
Aptitude for Research Question I 








Variable B SEB B RE RA 
Listening aptitude 2h saakO 
Constant 17.20 0.97 a 
BE auditory learning style 
score —0.30 0.084 —-.31™ 
Reading aptitude 24 = .06 
Constant 14.84 0.81 oe 
BE auditory learning style 
score —0.18 0.070 —.24™ 


Difference between listening and 
reading aptitude — — — 





Note. The following predictor variables were entered into the model for 
(a) listening aptitude, (b) reading aptitudes, and (c) difference between 
listening and reading aptitudes; Building Excellence (BE) auditory learning 
style score; BE visual word learning style score; and the difference be- 
tween BE auditory and BE visual word score. Only the variables listed 
above made significant contributions to these models. No variables made 
significant contributions to the model for difference between listening and 
reading aptitude. Shown are the coefficients (B), the standard error of the 
coefficients (SEB), as well as standardized coefficient (8), and the corre- 
lation. N = 121. 

po. =p = 01. 


was negative. That is, as auditory learning style preference in- 
creased, performance on a listening aptitude test decreased. 
Predicting reading comprehension aptitude from learning 
preference scores. The meshing hypothesis would predict that if 
visual word learning style preference equates to reading compre- 
hension aptitude, as participants’ visual word learning style pref- 
erence score increased, their reading comprehension aptitude score 
would also increase. As shown in Table 1, the correlation between 
visual word learning style preference (based on the BE visual word 
score) and reading comprehension (based on the R-AT score) was 
neither positive nor significant (—.04). To further test whether other 
learning style variables influence reading comprehension aptitude, 
we calculated a multiple linear regression analysis to determine the 
extent to which participants’ reading comprehension aptitude (R- 
AT) could be predicted based on their BE auditory learning style 
score, BE visual word learning style score, and the difference 
between their BE auditory and BE visual word scores. As seen in 
Table 2, a significant regression equation was found, F(1, 119) = 
7.01, p < .01, with an R* of .06. However, the only BE variable 
that contributed significantly to the reading comprehension score 
was the BE auditory learning style score. This single variable 
contributed a correlation coefficient of R = .24, R? = .06 (SE = 
2.72), p < .01. Participants’ predicted reading comprehension 
score is equal to 14.84 (constant) + —0.18 (BE auditory learning 
style score), indicating a negative relationship between BE audi- 
tory learning style score and reading comprehension aptitude 
score. The coefficient model shows that for every 1 point that the 
BE auditory learning style score decreased, the reading compre- 
hension score increased 0.18 points. Contrary to the assumption 
that a visual verbal learning style preference would predict higher 
reading scores, neither the BE visual word learning style score nor 
the difference between the BE auditory learning style score and BE 


visual word learning style score contributed any significant vari- 
ance beyond that already accounted for by the BE auditory learn- 
ing style score. This analysis demonstrated that auditory learning 
style was also the only significant predictor of reading compre- 
hension scores, and this relationship was again negative. 
Predicting the difference between listening comprehension 
aptitude and reading comprehension aptitude from learning 
preference scores. The meshing hypothesis would predict that 
individuals who have a stronger auditory learning style preference 
would also have a higher listening versus reading comprehension 
aptitude score, and conversely, those who have a stronger visual 
word learning style preference would also have a higher reading 
versus listening comprehension aptitude score. A multiple linear 
regression was calculated to determine the extent to which partic- 
ipants’ difference between listening comprehension aptitude (L— 
AT) and reading comprehension aptitude (R-AT) could be pre- 
dicted based on their BE auditory learning’ style score, BE visual 
word learning style score, and the difference between their BE 
auditory and BE visual word scores. This regression analysis most 
completely tests the meshing hypothesis, which not only predicts 
a simple relationship between learning style preference and com- 
prehension aptitude but also and more specifically predicts that 
individuals with different learning styles will perform differen- 
tially with different modes of input. The results of this analysis 
failed to support this prediction. None of the variables (BE audi- 
tory learning style score, BE visual word learning style score, or 
the difference between BE auditory and BE visual word scores) 
contributed significantly to the difference between listening com- 
prehension aptitude and reading comprehension aptitude. 
Discussion of analyses for Research Question 1. Pashler et 
al. (2008) pointed out that learning style preferences and learning 
aptitudes are often considered to be overlapping constructs. After 
all, it seems intuitive that individuals who prefer to listen would 
perform better on a test of listening than reading comprehension 
and, conversely, those who prefer reading would perform better on 
a test of reading than listening comprehension. This relationship is 
referred to as the meshing hypothesis. Research Question 1 was 
designed as an empirical test of this hypothesis, as it pertains to 
verbal comprehension aptitude. Participants completed the BE 
Learning Styles Inventory as well as both a listening and a reading 
comprehension aptitude test. A series of analyses were calculated 
to determine the extent to which auditory and visual word learning 
style variables predicted listening and/or reading comprehension 
aptitude. Both a continuous score (based on the 17-point scale 
established by the BE Learning Style Inventory) and a categorical 
classification of participants as either an auditory learner or visual 
word learner were included in these analyses. This categorical 
classification was based on the 5-point BE scale and included only 
those participants with a strong difference between their auditory 
and visual word reading preference scores. Regardless of whether 
continuous or categorical scores from the BE Learning Styles 
Inventory were used, the results were consistent in failing to 
provide statistically significant support for the meshing hypothesis. 
Contrary to the expectations predicted by the meshing hypothesis, 
that a high visual word learning style score would be the best 
predictor of a high reading comprehension aptitude score, and 
conversely, that a high auditory learning style score would be the 
best predictor of a high listening comprehension aptitude score, 
auditory learning style proved to be the only significant predictor 
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of both reading and listening comprehension scores, and in both 
cases this relationship was negative. That is, as individuals’ audi- 
tory learning style preference scores increased, their performance 
on both the listening and reading comprehension aptitude tests 
decreased. Thus, the results using the BE learning style preference 
scores both as a continuous scale as well as a discrete categorical 
measure of auditory and visual word learning style preference fail 
to demonstrate a significant positive relationship between (a) au- 
ditory learning style preference and listening aptitude, (b) visual” 
word learning style preference and reading aptitude, or (c) a 
differential effect of learning style preference on performance on a 
listening compared with a reading comprehension aptitude test. 
These findings fail to support the construct that an individual’s 
learning style (auditory, visual word) is positively correlated with 
their listening and reading aptitude. Taken together, these data fail 
to provide statistical support for the meshing hypothesis, at least as 
it pertains to verbal comprehension (listening vs. reading) apti- 
tudes. 

Limitations of analyses for Research Question I. One poten- 
tial concern for the analyses reported for Research Question 1 was 
that the L-AT and R—-AT were not matched for difficulty. It is 
important to emphasize that the tests developed for this study to 
assess listening and reading comprehension aptitude, while derived 
from two equivalent forms of a standardized, normed reading test 
(GORT-4), were not given in the standard format on which these 
norms were based. The main question of interest is whether there 
was a differential pattern of results for auditory compared with 
visual word learners when listening compared with reading. In this 
study, both the auditory and the visual word learning style pref- 
erence groups scored higher on the listening than on the reading 
comprehension aptitude test. This could be an indication that the 
listening version of the test was easier than the written version. 
While equivalent scores would have been more ideal, it is impor- 
tant to note that it is the pattern of the results, rather than the 
absolute values, that is critical in addressing the meshing hypoth- 
esis. As shown in Figure 1C, the difference in listening compared 
with reading performance resulted in parallel lines for the auditory 
learners compared with the visual word learners. That is, while 
participants classified by the BE Learning Style Inventory as 
auditory learners did, indeed, score higher on a comprehension test 
when they listened to versus read passages, participants classified 
as visual word learners showed a similar pattern; that is, they also 
scored higher on this same comprehension test when they listened 
to versus when they read the test passages—and to the same 
degree. Taken in context, the results from Research Question 1 are 
contrary to the pattern that would be expected based on the 
meshing hypothesis, at least as it applies to tests of listening and 
reading comprehension aptitude. However, this research question 
does not address the issue of whether learning and retention of 
“real-world,” nonfiction material, presented using different in- 
structional methods, is affected by an individual’s preferred learn- 
ing style. This was the focus of Research Question 2. 


Analyses for Research Question 2 


Research Question 2 addresses the extent to which learning style 
preferences (as measured by the BE Learning Style Inventory) 
and/or learning aptitudes (as measured by the L-AT and R—AT) 
predict how much an individual comprehends and retains based on 


mode of instruction (audiobook, e-text) as measured by the Un- 
broken comprehension test. 

The validity of the Unbroken comprehension test was evaluated 
to assure that results obtained from this test were an accurate 
measure of comprehension. To do this, the comprehension score of 
each participant on the Unbroken comprehension test at Time 1 
was compared with the same participant’s total comprehension 
aptitude score (total verbal comprehension aptitude = L-AT + 
R-AT). A Pearson correlation coefficient was calculated. A pos- 
itive correlation was found, r(119) = 0.59, p < .001, indicating 
that there was a significant relationship between participants’ 
scores on the total verbal comprehension aptitude test and partic- 
ipants’ scores on the Unbroken comprehension test at Time 1. This 
analysis showed that participants who had higher comprehension 
scores as indicated by the total verbal comprehension aptitude test 
also had higher comprehension scores on the Unbroken test at 
Time 1, providing construct validity for this test. Next, a Pearson 
correlation coefficient was calculated for the relationship between 
the Unbroken comprehension test at Time 1 and Time 2. A strong 
positive correlation was found, r(118) = 0.86, p < .01, indicating 
a significant linear relationship between the two variables. Partic- 
ipants who performed well on the Unbroken comprehension test at 
Time 1 performed well on this same test at Time 2. This linear 
relationship indicates strong test-retest reliability for the Unbroken 
comprehension test. 

A 2 (modes of instruction) X 2 (time) mixed-design ANOVA 
was calculated to evaluate the effects of mode of instruction 
(audiobook, e-text) and time (Time 1, Time 2) on the Unbroken 
comprehension test scores. There was a main effect of time (Time 
1 vs. Time 2), F(1, 58) = 37.3; p < .05. However, there was no 
significant main effect for mode of instruction, F(1, 58) = 0.25; 
p => .05. In addition, there was not a significant mode of instruction 
by time interaction, F(1, 58) = 0.08; p > .05. These results 
indicate that there was no difference in difficulty on the Unbroken 
comprehension test when presented by audiobook versus e-text. 
Moreover, all participants performed better in both instructional 
conditions at Time | than Time 2. 

Analyses using categorical learning style variables to predict 
learning via audiobook versus e-text mode of instruction at 
Time 1: The Pashler et al. (2008) method. When the meshing 
hypothesis is applied to education theory and practice, it is as- 
sumed that learning will be more effective when material is pre- 
sented in an instructional mode that meshes with the individual’s 
preferred learning style. Pashler et al.’s (2008) meshing hypothesis 
pertaining to learning style preference and modes of instruction 
predicts that individuals with a visual learning style preference will 
comprehend better when they read rather than listen, and con- 
versely, individuals with an auditory learning style preference will 
comprehend better when they listen rather than read. The Pashler 
et al. (2008) roadmap for evaluating the meshing hypothesis em- 
pirically begins by dividing participants into distinct learning style 
preference groups. Therefore, for this analysis, participants were 
classified as auditory or visual word learners based on their BE 
Learning Styles Inventory results, as described in the Methods 
section. 

Results from Research Question 1 showed that there were 
significant differences in reading and listening aptitude for study 
participants based on their learning style preference. Recall that 
participants with a visual word learning style preference achieved 
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both higher listening and reading aptitude scores than participants 
in the auditory learning style preference group. As a result, to 
control for any effect of potential differences in total verbal com- 
prehension aptitude, based on the random assignment to instruc- 
tional condition in Research Question 2, we conducted all analyses 
both with and without co-varying out the effect of total reading and 
listening aptitude. No significant differences were found with or 
without the covariance. As such, only the ANOVA results are 
reported. Table 3 shows the Unbroken comprehension test raw 
scores (total number correct out of 48) by learning style preference 
group (auditory, visual word) and instructional condition (audio- 
book, e-text) at Time 1 and Time 2. A between-subjects 2 (learning 
style preference) X 2 (mode of instruction) ANOVA was per- 
formed using these data to examine the effect of different learning 
style preferences (auditory, visual word) and different modes of 
instruction (audiobook, e-text) on the Unbroken comprehension 
test scores at Time 1. The results of this analysis showed that the 
main effect for learning style preference was significant, FC, 
37) = 6.11; p < .05, indicating a significant difference between 
participants with an auditory learning style preference (M = 30.57; 
SD = 5.89), and those with a visual word learning style (VM = 
34.40; SD = 3.33). This demonstrates that Unbroken comprehen- 
sion at Time 1 was affected by learning style preference, with the 
participants with visual word learning styles performing better. 
However, the main effect for instructional condition (audiobook, 
e-text) was not significant, F(1, 37) = .15; p > .05, with no 
significant difference in performance on the Unbroken compre- 
hension test at Time 1 between participants in the audiobook 
condition (MV = 32.10; SD = 6.00) and those in the e-text condi- 
tion (M = 32.80; SD = 4.16). Finally, the interaction between 
instructional condition (audiobook, e-text) and learning style pref- 
erence (auditory, visual word) was not significant, F(1, 37) = 
0.42; p > .05, indicating that providing instruction in a mode that 
matched an individual’s learning style preference did not result in 
significantly better learning. Figure 2C shows the results of this 
analysis. 

According to Pashler et al. (2008), acceptable evidence in sup- 
port of the meshing hypothesis would show a crossover between 
the two learning style preference groups and two modes of instruc- 
tion, as shown in Figure 2A. Figure 2B shows an example from 
Pashler et al. (2008) of unacceptable evidence for the meshing 
hypothesis, where both auditory and visual word learning style 
preference groups score higher on the same method of instruction, 
and hence there is no crossover. As seen in Figure 2C, contrary to 


Table 3 

Unbroken Comprehension Test Raw Scores (Total Number 
Correct Out of 48) by Learning Style Preference Group and 
Instructional Condition at Time 1 and Time 2 
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Figure 2. Examples of (A) acceptable evidence and (B) unacceptable 
evidence for the meshing hypothesis (according to Pashler et al., 2008). 
Graph C displays the results from this study and corresponds to one of 
Pashler et al.’s (2008) examples of unacceptable evidence. Error bars 
represent standard errors. The 95% confidence interval (CI) for the audio- 
book condition ranged from 26.82 to 32.81 for participants with an audi- 
tory learning preference and from 31.46 to 37.74 for participants with a 
visual learning preference. The 95% CI for the e-text group ranged from 
28.26 to 34.54 for participants with an auditory learning preference and 
from 31.06 to 37.34 for participants with a visual learning preference. 


the crossover pattern that would be expected to support the mesh- 
ing hypothesis, results from the current study show there is min- 
imal difference based on instructional condition for participants in 
either the auditory or visual word learning style preference groups. 
According to Pashler et al. (2008), this pattern corresponds with 
one example of unacceptable evidence in support of the meshing 
hypothesis. 

Two-week retention (Time 2). It is possible that presenting 
instruction in a mode that meshes with an individual’s learning 
style may affect longer term retention of information. To address 
this possibility, we calculated a 2 (learning style preference) X 2 
(mode of instruction) ANOVA to examine the long-term (2-week) 
effect of instructional condition (audiobook, e-text) and learning 
style preference (auditory, visual word) on Unbroken comprehen- 
sion test scores at Time 2. The results at Time 2 parallel those 
found at Time 1. That is, the main effect for learning style 
preference was significant, F(1, 37) = 9.18; p < .05, indicating 
that Unbroken comprehension test scores at Time 2 were affected 
by learning style preference. Participants with an auditory learning 
style preference performed significantly more poorly (M = 27.33; 
SD = 5.74) than those with a visual word learning style preference 
(M = 32.25; SD = 4.23). The main effect of group was not 
significant, F(1, 37) = .03; p > .05, with no significant difference 
between participants using an audiobook (M = 29.52; SD = 6.55), 
and those using e-text (M = 29.95; SD = 4.51). Finally, the 


MATCHING LEARNING STYLE TO INSTRUCTIONAL CONDITION 73 


interaction between instructional condition and learning style pref- 
erence was not significant, F(1, 37) = 0.54; p > .05, indicating 
that providing instruction in a mode that matched an individual’s 
learning style preference did not result in significantly better 
retention. 

There were no significant interactions between learning style 
preference and mode of instruction based on a categorical classi- 
fication of participants into two discrete groups, those with an 
auditory and those with a visual word learning style. However, 
classification of participants into two discrete groups reduced the 
sensitivity of continuous variables and also reduced the sample 
size by including only those participants who had a clear auditory 
or visual word learning style preference. To mitigate these con- 
cerns, we conducted a final series of correlation and stepwise 
multiple regression analyses to evaluate whether there was a 
significant relationship among learning style preference (BE 
Learning Styles Inventory), learning aptitude (L-AT, R-AT), and 
mode of instruction (digital audiobook, e-text). For these analyses, 
variables from the BE Learning Styles Inventory based on the 
continuous 17-point standard BE scoring system and verbal apti- 
tude scores based on the L-AT and R-AT were used to predict (a) 
learning outcomes from the audiobook mode of instruction and (b) 
learning outcomes from the e-text mode of instruction (Table 4). 

Correlation and regression analyses for Research Question 2. 

Predicting audiobook learning outcomes from learning style 
preference and verbal aptitude scores at Time 1. The meshing 
hypothesis predicts a positive correlation between learning style 
preference and instructional mode. That is, as auditory learning 
style preference scores increase, learning outcomes via the audio- 
book mode of instruction, but not via the e-text mode of instruc- 
tion, would also increase. As seen in Table 4, counter to what 
would be predicted by the meshing hypothesis, when the Pearson 
correlation was calculated examining the relationship between 
auditory learning style preference and learning from an audiobook, 
a weak negative correlation that was not significant was found, 
r(28) = —.30, p > .05. When the Pearson correlation was calcu- 
lated examining the relationship between visual word learning 


Table 4 


style preference and learning from an audiobook, the results were 
similar; a weak negative correlation that was not significant was 
found, r(28) = —.24, p > .05. 

Similarly, to further test whether any other learning style or 
aptitude variables influenced learning outcomes from the audio- 
book mode, a multiple linear regression was calculated to deter- 
mine the extent to which participants’ learning of nonfiction ma- 
terial presented in audiobook format (instructional condition) 
could be predicted based on their learning style preference (BE 
auditory, BE visual word, and the difference between BE auditory 
and BE visual word scores) as well as comprehension aptitude 
(R-AT, L-AT, and verbal comprehension aptitude difference). As 
seen in Table 5, a significant regression equation was found, F(1, 
28) = 16.18, p < .001. Listening aptitude (L-AT) was the only 
variable that contributed significantly to the comprehension of the 
material presented via the audiobook condition. This single vari- 
able contributed a correlation coefficient of 0.61, R* = .37 (SE = 
5.43), p < .001. The regression equation showed that comprehen- 
sion of material in the audiobook instructional condition was equal 
to 15.71 (constant) + 1.15 (listening comprehension aptitude), 
indicating a positive relationship between listening aptitude, and 
auditory instruction. The coefficient model shows that for every 1 
point the audiobook comprehension increased, the listening apti- 
tude score increased 1.15 points. This analysis demonstrated that 
only a component of aptitude (L-AT) contributed significantly to 
the variance in learning based on auditory instruction. BE auditory 
learning style score, BE visual word learning style score, and the 
difference between BE auditory and BE visual word score, as well 
as reading aptitude (R-AT), and the difference between listening 
aptitude and reading aptitude (L-AT—R-AT), failed to contribute 
any significant variance beyond that already accounted for by the 
listening aptitude score. 

Predicting e-text learning outcomes from learning style pref- 
erence and verbal aptitude scores at Time I. The meshing 
hypothesis predicts a positive correlation between learning style 
preference and instructional mode. That is, as visual word learning 
style preference scores increase, learning outcomes via the e-text 


Descriptive Statistics and Correlation Matrix for the Predictor Variables Entered into the Multiple Regression Model for the 
Audiobook and e-Text Instructional Conditions at Time I as Described in Research Question 2 
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Table 5 

Coefficients for the Significant Predictor Variables Entered into 
the Multiple Regression Model for Audiobook and E-Text 
Learning at Time 1 as Described in Research Question 2 





Variable B SEB B R R? N 

Audiobook 

Learning 61 OT 30 

Constant Lal 3.89 ibe 

Listening aptitude Teds 0.29 Ole 
E-text 

Learning .70 49 31 

Constant 16.80 2.86 a 

Listening aptitude 1.01 0.19 10 


Note. The following predictor variables were entered into the model for 
audiobook and e-text learning: listening aptitude; reading aptitude; differ- 
ence between listening and reading aptitude; Building Excellence (BE) 
auditory learning style score; BE visual word learning style score, and the 
difference between BE auditory and BE visual word score. Only the 
variables listed above made significant contributions to the model. Shown 
are the coefficients (B), the standard error of the coefficients (SEB), as well 
as standardized coefficient (8), and the correlation. N = 30 (audiobook); 
N = 31 (e-text). 
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mode of instruction, but not via the audiobook mode of instruction, 
would also increase. As shown in Table 4, when the Pearson 
correlation coefficients were calculated, there was a weak positive 
correlation between visual word learning style preference and 
learning from e-text, r(29) = .05, p > .05, and a weak negative 
correlation between auditory word learning style preference and 
learning from e-text (7(29) = -.24, p > .0S. However, neither 
correlation was significant. 

To further test whether any learning style or aptitude variables 
influence learning outcomes from the e-text mode of instruction, a 
multiple linear regression was calculated to determine the extent to 
which participants’ learning of nonfiction material presented in 
e-text format (instructional condition) could be predicted based on 
their learning style preference (BE auditory, BE visual word, and 
the difference between BE auditory and BE visual word scores) as 
well as comprehension aptitude (R-AT, L-AT, and verbal com- 
prehension aptitude difference). As shown in Table 5, a significant 
regression equation was found, F(1, 29) = 27.67, p < .001 with an 
R? of .49. Contrary to what would be predicted by the meshing 
hypothesis, however, the only variable that contributed signifi- 
cantly to the learning of the material presented in the e-text 
condition was listening comprehension aptitude (L—AT). This sin- 
gle variable contributed a correlation coefficient of R =10, R= 
49 (SE = 3.93), p < .001. The regression equation showed that 
comprehension of material in the e-text instructional condition was 
equal to 16.80 (constant) + 1.01 (L-AT), indicating a positive 
relationship between listening comprehension aptitude and learn- 
ing from e-text instruction. The coefficient model shows that for 
every 1 point e-text learning increased, listening comprehension 
aptitude increased by 1.01 points. This analysis demonstrated that 
only listening comprehension aptitude (L-AT) contributed signif- 
icantly to the variance in learning material presented via e-text 
instruction. BE auditory learning style score, BE visual word 
learning style score, and the difference between BE auditory and 
BE visual word score, as well as R-AT, total verbal comprehen- 


sion aptitude, and the verbal comprehension aptitude difference 
failed to contribute any significant variance in learning beyond that 
already accounted for by the L—-AT. 

Predicting audiobook and e-text learning outcomes from 
learning style preference scores only at Time 1. When both 
verbal comprehension aptitude and learning style preference vari- 
ables were entered into multiple regression analyses to predict 
learning via either audiobook or e-text modes of instruction, only 
aptitude measures proved to significantly predict learning out- 
comes. In a final attempt to find a significant relationship between 
learning style preference and effects of instructional mode on 
learning, we conducted a regression analysis using only learning 
style preference variables (BE auditory, BE visual word, and the 
difference between BE auditory and BE visual word scores) to 
predict (a) audiobook and (b) e-text learning outcomes. The results 
of these analyses failed to provide any statistically significant 
support for the meshing hypothesis in that none of the BE learning 
style preference variables accounted for a statistically significant 
amount of variance for either audiobook (p > .05) or e-text (p > 
.05) learning outcomes. 

Predicting audiobook and e-text learning outcomes from 
learning style preference and verbal aptitude scores at Time 2. 
Even if learning style preferences do not affect immediate 
learning of material based on mode of instruction (audiobook, 
e-text), it is possible that presenting instruction in a mode that 
meshes with an individual’s learning style may affect longer 
term retention of information. Just as was done using the 
categorical variables for learning style preference, all analyses 
were repeated using the continuous variables based on the 
2-week retention data obtained at Time 2. The results of these 
analyses are shown in Tables 6 and 7. As can be seen by directly 
comparing the correlation matrices obtained at Time 1 (Table 4) 
with those obtained at Time 2 (Table 6), as well as the multiple 
regression models obtained at Time 1 (Table 5) with those 
obtained at Time 2 (Table 7), the results were very similar at 
Time 2 to those found at Time 1. The only significant correla- 
tion found between audiobook learning and auditory learning 
style preference was found at Time 2, and this correlation was 
negative (-.39, p < .05). Similarly, results from the stepwise 
multiple regression analyses were similar at Time 2 to those 
found at Time 1; only aptitude scores positively predicted 
audiobook and e-text learning, with no significant learning 
preference variables entering the model (Table 7). Thus, similar 
to the results pertaining to immediate learning obtained at Time 
1, the results obtained at Time 2 failed to provide any statisti- 
cally significant evidence that showed that providing individu- 
als with instruction in a mode that meshes with their learning 
style preference results in significantly better long-term reten- 
tion of information. : 

Discussion of analyses for Research Question 2. Research 
Question 2 investigated the meshing hypothesis as it pertains to 
mode of instruction. Specifically, the meshing hypothesis predicts 
that participants with an auditory learning style preference will 
learn material better when instruction is presented via a listening 
mode than when it is presented via a written mode and, conversely, 
those with a visual word learning style preference will learn 
material better after having read it rather than having listened to it. 
An ANOVA was calculated to determine if the experiment pro- 
vided any statistically significant evidence that showed that the 
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Table 6 


Descriptive Statistics and Correlation Matrix for the Predictor Variables Entered into the Multiple Regression Model for Audiobook 
and E-Text Learning at Time 2 as Described in Research Question 2 
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method most effective for instructing individuals with one learning 
style is not the most effective method for individuals with a 
different learning style. The results of these analyses failed to 
provide empirical support for the meshing hypothesis. No signif- 
icant interactions were found between learning style preference 
(auditory, visual word) and instructional method (digital audio- 
book, e-text) for either immediate learning or 2-week retention of 
verbal information. 

A second series of simple and multiple regression analyses were 
conducted using continuous variables of both learning style pref- 
erence as well as verbal comprehension aptitude. When both 
learning style and verbal comprehension aptitude variables were 
pitted against each other in multiple regressions to predict learning 


Table 7 

Coefficients for the Significant Predictor Variables Entered into 
the Multiple Regression Model for Learning From Audiobook 
and E-Text at Time 2, as Described in Research Question 2 








via either digital audiobook or e-text, only the aptitude variables 
accounted for a significant amount of variance in learning. When 
only learning style variables were entered into these multiple 
regression analyses, they failed to account for a significant amount 
of variance in learning. Thus, regardless of scoring method used 
(categorical or continuous), the results from Research Question 2 
failed to find a significant interaction between learning style pref- 
erences (auditory, visual word) and instructional method (digital 
audiobook, e-text) on learning or retention of information from a 
nonfiction text. 


General Discussion 


According to Pashler et al.’s (2008) recent review of the learn- 
ing styles literature, there is widespread belief among educators 
and the general public alike that individuals learn better when they 
are presented instruction in the modality that capitalizes on their 
learning style preference. Pashler et al. (2008) focused on the 
extent to which auditory and visual learning style preferences 
influence verbal comprehension. Specifically, they focused on the 


Variable B SEB B eR ae a 

meshing hypothesis that proposes that individuals with a visual 
Audiobook aa Aaa) learning style preference will learn more when information is 
peewee 265 4.41 ; presented to them in a written format, and conversely, those with 
Listening aptitude 12] 031 55* an auditory learning style preference will learn more when instruc- 
Reading aptitude 0.78 0.34 Sishe tion is presented to them in a listening format. They also pointed 
E-text ; out that the meshing hypothesis may have led to the belief that 
Comprehension ‘3 Poaceae fete t0 learning style preferences and learning aptitudes for verbal com- 
OSES ie a hension are similar constructs. Their review of the literature led 

Listening aptitude Phos ets” S66 Denese er M 


Note. The following predictor variables were entered into the model for 
audiobook and e-text learning: listening aptitude; reading aptitude; differ- 
ence between listening and reading aptitude; Building Excellence (BE) 
auditory learning style score; BE visual word learning style score; and the 
difference between BE auditory and BE visual word score. Only the 
variables listed above made significant contributions to the model. Shown 
are the coefficients (B), the standard error of the coefficients (SEB), as well 
as standardized coefficient (8), and the correlation. N = 30 (audiobook); 
N = 30 (e-text). 

ep—.055 0 pest. Ol: 


them to conclude, however, that there is little empirical evidence to 
support a direct relationship between learning style preferences 
(auditory, visual) and either (a) verbal comprehension aptitude 
(listening vs. reading) or (b) differential learning outcomes based 
on different modes of instruction (e.g., audiobook vs. e-text). 
However, they also concluded that the definitive study had not 
been conducted, and therefore, they prescribed a detailed roadmap 
for the experimental methodology that would be needed to address 
these important issues empirically as well as explicit examples of 
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the patterns of data that would either support or refute the meshing 
hypothesis. 

We conducted an investigation of the meshing hypothesis with 
college-educated adults following the research methods laid out by 
Pashler et al. (2008) to address two research questions. In Research 
Question 1, we used these methods to assess the extent to which an 
individual’s learning style preference (auditory, visual word) was 
consistent with his or her learning aptitude for verbal comprehen- 
sion (listening, reading). In Research Question 2, we used these 
methods to assess the extent to which an individual’s learning style 
preference (auditory, visual-word) differentially affected how 
much they would learn and retain from nonfiction text presented 
using two different modes of instruction (digital audiobook, 
e-text). 

Results from Research Question 1 showed that differences in 
preferred learning style (auditory, visual word) were not found to 
significantly predict differences in learning aptitude (listening vs. 
reading comprehension). That is, there were no statistically signif- 
icant results that showed that individuals with stronger auditory 
learning style preferences had higher listening comprehension 
aptitude than reading aptitude or, conversely, that individuals with 
stronger visual word learning style preferences had higher reading 
than listening aptitude. Instead, participants classified with a pre- 
ferred visual word learning style outperformed those classified as 
having a preferred auditory learning style on both the listening and 
reading comprehension aptitude tests. These results show that 
learning style preference and aptitude are not comparable con- 
structs. Thus, the results from Research Question 1 failed to 
provide statistically significant support for the meshing hypothesis, 
at least as it pertains to the relationship between learning style 
preference (auditory, visual word) and verbal comprehension ap- 
titude (listening, reading), respectively. 

Similar to the results from Research Question 1, the results from 
Research Question 2 also failed to provide statistically significant 
empirical evidence supporting the meshing hypothesis, either for 
immediate learning or long-term retention of information pre- 
sented via two different modes of instruction (audiobook, e-text). 
Regardless of whether categorical or continuous measures of 
learning styles were used or which method of analysis (ANOVA, 
simple correlations, multiple regression analyses) was chosen, 
there were no significant findings that showed that providing 
instruction to individuals in a mode that meshed with their pre- 
ferred learning style resulted in better learning or retention of 
information compared with instructing them in their nonpreferred 
mode. 

In conclusion, at least for verbal comprehension, no statistically 
significant evidence was found in this investigation to support the 
construct (a) that learning style is equivalent to learning aptitude or 
(b) that providing instruction in the modality that meshes with an 
individual’s preferred learning style will result in significantly 
better learning or retention than presenting the same instruction in 
an individual’s nonpreferred learning style. 


Overall Limitations of the Study 


One potential limitation in interpreting the results of Research 
Question 1 was that the L-AT and R-AT proved not to be matched 
for difficulty. Both the auditory and the visual word learning style 
preference groups scored higher on the listening than on the 


reading comprehension aptitude test. This could be an indication 
that the listening version of the test was easier than the written 
version. We pointed out that while equivalent scores on the L-AT 
and the R-AT would have been more ideal, it is the pattern of the 
results, rather than the absolute values, that is critical in addressing 
the meshing hypothesis. That is, the main question of interest is 
whether there is a differential pattern of results for participants 
with an auditory compared with a visual word learning style 
preference in respect to listening compared with reading aptitude, 
and the analyses showed that there was not. The results from 
Research Question 2 also addressed this issue. Recall that in this 
case there was no significant main effect of condition (audiobook, 
e-text) on performance on the Unbroken comprehension test. 
There was also no significant interaction found between learning 
style preference and instructional condition. This provides further 
evidence that the failure to find significant support for the meshing 
hypothesis in Research Question 1 was not likely due to differ- 
ences in listening versus reading test difficulty. 

A second limitation of the study discussed for Research Ques- 
tion 1 pertained to the fact that regardless of mode of instruction, 
comprehension was assessed using written questions only. The 
same limitation also applies to Research Question 2. We consid- 
ered that holding the format of the assessment constant would 
allow only one variable (in this case, mode of instruction) to be 
varied within the study. A written format was chosen over a 
listening format because this is consistent with how most tests are 
given. However, it could be argued that using the same (written) 
format for the assessment of learning may have favored those 
individuals who had a stronger visual learning style preference 
and, thus, masked evidence supporting the meshing hypothesis. 
Indeed, it was found in both Research Questions 1 and 2 that 
participants with a visual word learning style preference performed 
significantly better than those with an auditory learning style 
preference on both the listening and reading comprehension tests, 
both of which were assessed by written questions. However, it also: 
should be recalled that both learning style preference groups 
performed better on the listening aptitude test than the reading 
aptitude test in Research Question 1, even though both were 
assessed with written questions. Regardless of these potential 
limitations to the study design, it should be kept in mind that the 
critical test of the meshing hypothesis rests in finding a significant 
interaction between learning style preference and either aptitude 
(Research Question 1) or learning based on mode of instruction 
(Research Question 2). This was not found in either case. None- 
theless, it may be important in future studies to determine if the 
meshing hypothesis may be supported if both modes of instruction 
as well as assessment measures of aptitude or learning are given in 
both a listening and a written format. 

Participants in this study were college-educated adults, and 
therefore, the results can only be generalized to similar populations 
with well-developed listening and reading comprehension skills. It 
will be particularly important for future research to repeat this 
same study with children of different ages who are in the process _ 
of developing reading skills to determine the extent to which mode 
of instruction, learning aptitude, and learning style preference may 
affect individual differences in learning outcomes at different 
stages during the development of language and literacy skills. It 
would also be important to determine longitudinally the extent to 
which mode of instruction or learning styles influence literacy 
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outcomes when instruction is provided over a longer period of 
time. 

This research focused narrowly on verbal comprehension skills 
and the extent to which learning differed when instruction is 
presented via an audiobook compared with e-text. While there are 
many different schemes for classifying individual learning styles, 
we used only one learning style inventory (the Rundle and Dunn 
Building Excellence Inventory) and within that inventory focused 
only on auditory and visual word learning styles. Thus, the degree 
to which the results of this study generalize to other disciplines or 
other learning styles cannot be established by this study. Further- 
more, instruction used in this study was given only one time and 
relied on participants learning information from the preface and 
one chapter in a nonfiction book. The extent to which the results 
of this study can be generalized to other forms of instruction, 
longer durations of instruction, and other types of material cannot 
be established. 

In Research Question 2, the sample size was substantially re- 
duced by the random placement of participants into different 
instructional conditions (audiobook, e-text) and because of the 
categorical analyses. Therefore, for several of the analyses con- 
cerned with finding relations among individual differences in 
learning style preferences or aptitudes and mode of instruction, the 
lack of statistical significance may be influenced by a lack of 
power due to a modest sample size. Nonetheless, when the results 
from both Research Question 1, which included a much larger 
sample size (V = 121) and Research Question 2 (n = 61) are 
considered in their entirety, they consistently fail to provide any 
empirical evidence that suggests individuals will learn signifi- 
cantly better when they are provided instruction in a mode that 
meshes with their preferred or stronger learning style than in a 
mode that does not. 


Conclusion 


The American education system as well as the general public 
has come to believe that optimal learning occurs if individuals are 
taught in their preferred learning style. Dekker et al. (2012) sur- 
veyed 242 primary and secondary school teachers from the United 
Kingdom (n = 137) and the Netherlands (n = 105) who were 
enthusiastic about applying neuroscientific findings into their in- 
struction. It was assumed that this population, given their high 
level of interest, would be current on effective research-based 
practices. The participants were given statements and were asked 
if the statements were “correct,” “incorrect,” or “do not know.” 
Results showed that 93% of teachers from the United Kingdom 
and 96% of teachers from the Netherlands answered “correct” to 
the statement: “Individuals learn better when they receive infor- 
mation in their preferred learning style (e.g., auditory, visual, 
kinesthetic).” The results of this study demonstrate how pervasive 
the misinformation of learning styles is in everyday classroom 
practice around the world. 

The idea of teaching to an individual’s learning style is attrac- 
tive. According to learning styles theory, if an individual is strug- 
gling to learn new material, it is possible that his or her poor 
performance results from not being taught in a mode that meshes 
with the individual’s preferred learning style. Thus, educators and 
professional development leaders spend time and resources assess- 
ing their students’ learning style and developing instruction to 


specifically match a student’s preferred learning styles. It is com- 
mon for lesson plans to include a section in which teachers are 
asked to explain how they will accommodate the different learning 
styles of students in their classroom. Therefore, the findings from 
this study have considerable relevance for educational theory and 
practice. 

The main finding from Research Questions 1 and 2 that may 
have a substantial influence on current educational practice is that 
when participants were categorized by their preferred learning 
style, either auditory or visual word, those who were classified as 
visual word learners performed better, compared with auditory 
learners, on verbal comprehension measures. In other words, vi- 
sual word learners scored higher than auditory learners on both the 
reading and the listening aptitude tests and the Unbroken compre- 
hension tests. Therefore, and counter to current educational beliefs 
and practices, educators may actually be doing a disservice to 
auditory learners by continually accommodating their auditory 
learning style preference by providing them instruction that 
meshes with their auditory learning style, rather than focusing on 
strengthening their visual word skills. It is important to keep in 
mind that most testing, from state standardized education assess- 
ments to college admission tests, is presented in a written word 
format only. Thus, it is important to give students as much expe- 
rience with written: material as possible to help them build these 
skills, regardless of their preferred learning style. Rather than 
continually accommodating auditory learners’ preference with in- 
creased instruction in an auditory format, auditory learners might 
benefit more from receiving instruction that specifically targets 
and strengthens their visual word skills. 

In a review of the learning styles literature, Pashler et al. (2008) 
did not find empirical support to justify matching instruction to 
learning style. He and his collaborators brought to light several 
pressing concerns. First, too often individuals allow their intuitions 
to shape their beliefs. We base our educational practice on trial and 
error, or we are complicit in always doing what has always been 
done. Changing the minds of teachers and teacher educators with 
regards to learning styles is no small feat. Pashler et al. (2008) 
stated, “If education is to be transformed into an evidence-based 
field, it is important not only to identify teaching techniques that 
have experimental support but also to identify widely held beliefs 
that affect the choices made by educational practitioners that lack 
empirical support” (p. 117). The goal of this study was to provide 
more empirical evidence to guide educational practitioners in 
making sound judgments pertaining to whether their students will 
or will not benefit from receiving instruction that meshes with their 
preferred learning style or aptitude. In the current study, we failed 
to find any statistically significant, empirical support for tailoring 
instructional methods to an individual’s learning style. 
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We had 3 aims in the present study: (a) to examine the dimensionality of various evaluative approaches 
to scoring writing samples (e.g., quality, productivity, and curriculum-based measurement [CBM] writing 
scoring), (b) to investigate unique language and cognitive predictors of the identified dimensions, and (c) 
to examine gender gap in the identified dimensions of writing. These questions were addressed using data 
from 2nd- and 3rd-grade students (VN = 494). Data were analyzed using confirmatory factor analysis and 
multilevel modeling. Results showed that writing quality, productivity, and CBM scoring were disso- 
ciable constructs but that writing quality and CBM scoring were highly related (r = .82). Language and 
cognitive predictors differed among the writing outcomes. Boys had lower writing scores than girls even 
after accounting for language, reading, attention, spelling, handwriting automaticity, and rapid automa- 
tized naming. Results are discussed in light of writing evaluation and a developmental model of writing. 


Keywords: dimensionality, writing quality, writing productivity, CBM, gender 


Students’ writing skill is assessed in multiple ways. To assess a 
discourse-level writing skill (e.g., ability to writing in paragraphs), 
students are typically asked to write written compositions, and 
written compositions are evaluated using multiple approaches such 
as writing quality, writing productivity, or curriculum-based mea- 
surement (CBM) writing scoring. Another widely used writing 
assessment measures a sentence-level writing ability by asking 
students to produce grammatically correct sentences within a spec- 
ified time (Writing Fluency task of the Woodcock—Johnson Tests 
of Achievement-III] [WJ-III], Woodcock, McGrew, & Mather, 
2001). Despite the existence of various ways of assessing students’ 
writing skill, researchers and practitioners have a limited under- 
standing of how these various assessments and evaluative ap- 





This article was published Online First July 7, 2014. 

Young-Suk Kim, College of Education and Florida Center for Reading 
Research, Florida State University; Stephanie Al Otaiba, College of Edu- 
cation, Southern Methodist University; Jeanne Wanzek, College of Edu- 
cation and Florida Center for Reading Research, Florida State University; 
Brandy Gatlin, College of Education, Florida State University. 

This research was supported by Grant P50HD052120 from the National 
Institute of Child Health and Human Development. The content is solely 
the responsibility of the authors and does not necessarily represent the 
official views of the National Institute of Child Health and Human Devel- 
opment. The authors thank study participants including students, teachers, 
school personnel, and parents. 

Correspondence concerning this article should be addressed to Young- 
Suk Kim, Florida Center for Reading Research, Florida State University, 
1114 West Call Street, Tallahassee, FL 32306. E-mail: ykim@fcrr.org 


1) 


proaches are related and whether they tap into or capture similar or 
dissociable dimensions of writing. A clearer understanding of 
assessment approaches is needed to advance theories of develop- 
ment and to guide practitioners in using assessment data to inform 
instruction and intervention. In the present study, we addressed 
this question with three goals. First, we examined how various 
approaches to writing assessments converge or diverge into dif- 
ferent dimensions, using various evaluative approaches such as 
writing quality, productivity, and CBM scoring as well as using a 
widely used sentence-level task, the WJ—III Writing Fluency task. 
Second, we further examined how language and cognitive skills 
relate to the identified dimensions. Finally, given the consistent 
achievement gaps between boys and girls on national writing 
assessments (e.g., Persky, Dane, & Jin, 2003), we also sought to 
examine gender differences across the identified dimensions of 
writing. 


Approaches to Writing Evaluation 


According to the simple view of writing (Juel, Griffith, & 
Gough, 1986), two necessary components of writing are ideation 
(i.e., generation and organization of ideas) and transcription skills. 
The first component, ideation, refers to the quality of ideas repre- 
sented in writing, which is an essential, and arguably the most 
important, aspect to be evaluated in written compositions. Not 
surprisingly, writing quality has long and widely been examined in 
previous studies. Two key indicators of writing quality appear to 
be the extent of development and organization of ideas (Bereiter & 
Scardamalia, 1987; Juel et al., 1986). In fact, idea development and 
organization have been widely examined as indicators of writing 
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quality in previous studies (Graham, Harris, & Chorzempa, 2002; 
Graham, Harris, & Mason, 2005; Kim, Al Otaiba, Folsom, & 
Greulich, 2013; Kim et al., 2011; Olinghouse, 2008). Other widely 
used assessments of writing examine similar aspects. For example, 
the fourth edition of the Test of Written Language (TOWL-4; 
Hammill & Page, 2009) includes theme development and organi- 
zation, and another widely used writing evaluation approach in 
U.S. schools (Gansle et al., 2006), the 6 + 1 Trait Rubric (North- 
west Regional Laboratory, 2011), includes idea development and 
organizational or structural aspects in addition to other aspects 
such as word choice, sentence fluency, voice, presentation, and 
conventions. 

The other component of the simple view of writing, transcrip- 
tion skill, allows generated ideas to be produced in written text and 
facilitates idea generation and development (Berninger et al., 
1997; Graham, Berninger, Abbott, Abbott, & Whittaker, 1997; 
Graham, Harris, & Fink, 2000; Kim et al., 2011). Therefore, the 
amount of written composition is constrained by transcription 
skills to a large extent, particularly for beginning writers. Not 
surprisingly, writing productivity is another widely examined di- 
mension of writing (e.g., Abbott & Berninger, 1993; Berman & 
Verhoevan, 2002; Kim et al., 2011, 2014; Mackie & Dockrell, 
2004; Olinghouse & Graham, 2009; Scott & Windsor, 2000). Note 
that although the term writing fluency has been used often to refer 
to a similar construct, we use the term writing productivity, be- 
cause we are specifically referring to the amount of text produced, 
not the automaticity, effortlessness, or coordination of multiple 
processes, which are defining characteristics of fluency (Berninger 
et al., 2010; LaBerge & Samuels, 1974). In addition, writing 
fluency has been conceptualized to refer to CBM writing (Ritchey 
et al., in press). Although the amount of text alone is not generally 
considered a yardstick or goal of good writing, good written 
composition requires a certain amount of text for the ideas to be 
sufficiently developed and articulated. Previous studies have ex- 
amined writing productivity, and it has been shown to be a disso- 
ciable dimension from writing quality (Kim, Al Otaiba, Sidler, 
Greulich, & Puranik, 2014; Wagner et al., 2011), and correlations 
between writing quality and productivity tend to be fairly strong 
for children in the elementary grades (e.g., .65 S rs S .82; Abbott 
& Berninger, 1993; Kim et al., 2014; Olinghouse & Graham, 
2009). Writing productivity is measured using various indicators 
such as the total number of words, number of ideas, number of 
different words, and/or number of sentences (Kim et al., 2014; 
Kim, Park, & Park, 2013; Puranik, Lombardino, & Altmann, 2008; 
Wagner et al., 2011). 

A third evaluative approach to writing employed in the present 
study is CBM scoring. CBM writing scoring includes some unique 
evaluative tools not included in the writing quality and productiv- 
ity indicators noted previously. Along with reading and math CBM 
measures, CBM writing measures are considered global outcome 
measures, or indicators, of students’ overall writing performance 
(Deno, 1985) that are intended to signal whether the student needs 
further diagnosis and intervention. CBM writing measures were 
initially developed to screen and monitor progress in writing skills 
for students at risk for writing difficulty. Students are typically 
asked to write for 3—5 min in response to prompts (Coker & 
Ritchey, 2010; McMaster, Du, & Pétursdottir, 2009; McMaster et 
al., 2011), and their writing is evaluated using various scoring tools 
such as number of words written, correct word sequences (two 


adjacent words that are grammatically correct and spelled cor- 
rectly), incorrect word sequences, words spelled correctly, percent- 
age of correct word sequences, and correct minus incorrect word 
sequences (see Graham, Harris, & Herbert, 2011; McMaster & 
Espin, 2007, for a review). Note that number of words written is 
not unique to the CBM writing scoring as it has been used as an 
indicator of writing productivity. 

CBM writing measures have been shown to be reliable, and 
students’ scores on CBM writing tend to be related to other writing 
measures with validity coefficients in the moderate range (see 
Graham et al., 2011, and McMaster & Espin, 2007, for reviews; 
Lembke, Deno, & Hall, 2003; McMaster et al., 2009). In particu- 
lar, the correct minus incorrect word sequences (CIWS) score 
tends to be the most strongly related to other writing measures with 
coefficients ranging from .60 to .75 (Espin et al., 2000; Espin, 
Weissenburger, & Benson, 2004). Recently, the percentage of 
correct word sequences (%CWS), along with the CIWS, has also 
been shown to be highly (r = .61) related to a normed writing task 
(Test of Written Language, 3rd ed. 1996) for children in middle 
school (Amato & Watkins, 2011). 

Despite the reliability and validity evidence for CBM writing 
scoring procedures described in these previous studies, it is not 
clear how CBM writing scores should be conceptualized in terms 
of dimensionality. That is, do CBM writing scores capture dimen- 
sions such as writing quality or writing productivity, or do they 
measure a separate, overall global outcome measure of writing? 
Recently, CBM writing measures have been described as writing 
fluency, which is defined as the ease with which an individual 
“produces written text” and includes both “text generation (trans- 
lating ideas into words, sentences, paragraphs, and so on) and 
transcription (translating words, sentences, and higher levels of 
discourse into print).” (italics in the original text; Ritchey et al., in 
press). A critical question is whether potential writing fluency 
indicators capture a dissociable dimension, apart from other widely 
examined dimensions such as writing quality and productivity. 
Although the theoretical foundations for using CBM writing scores 
as measurement are still in their nascent stage, we included CBM 
in the present study because of validity evidence with other writing 
measures, and its potential practical utility for progress-monitoring 
purposes, as CBM indicators have been shown to be sensitive to 
growth over time within a short time period (e.g., 2 weeks; see 
Espin et al., 2004; McMaster & Espin, 2007). 

Finally, although writing skill is typically assessed by asking the 
child to produce a written composition, other tasks also have been 
used. One such widely used standardized subtest is the Writing 
Fluency task of the WJ-—III (Woodcock et al., 2001). This task 
assesses sentence-level, rather than paragraph-level, writing. Chil- 
dren are presented with a picture and three words, and they are 
asked to write a sentence about the picture using the three words. 
The child’s score is the number of correct and meaningful written 
sentences based on the three words that were presented. However, 
how the WJ-III Writing Fluency relates to other dimensions of 
writing is an open question. 

In the present study, we examined dimensionality of writing 
using children’s data from written compositions as well as the 
Writing Fluency task of the WJ-III. Children’s written composi- 
tions were evaluated by indicators of writing quality, productivity, 
and CBM writing scores. For the Writing Fluency task of the 
WJ-III, scores following the WJ-III scoring guidelines were used. 
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Our goal in the present study was to extend our understanding of 
writing dimensionality. Previous studies have shown that writing 
quality, productivity, spelling and writing conventions, and syn- 
tactic complexity are dissociable dimensions for typically devel- 
oping children in Grades 1 and 4, as well as children with language 
impairments (Kim et al., 2014; Puranik et al., 2008; Wagner et al., 
2011). In the present study, we expand this line of research by 
examining how CBM scores and the Writing Fluency task of the 
W/J-IIl are related to writing quality and productivity dimensions 
using data from children in Grades 2 and 3. 


Predictors of Writing Skills 


As noted earlier, writing is composed of at least two component 
skills: transcription skills and ideation (Berninger & Swanson, 
1994; Juel et al., 1986). Transcription skills such as spelling and 
handwriting allow mental resources such as attention and working 
memory to be available for idea generation and translation pro- 
cesses (Berninger & Swanson, 1994; Graham, 1990; Graham et al., 
1997, 2000; Scardamalia, Bereiter, & Goleman, 1982). Much 
evidence supports the role of transcription skills in writing 
(Berninger, 1999; Graham et al., 1997; Jones & Christensen, 1999; 
Kim et al., 2011, in press; Wagner et al., 2011). Handwriting skill 
is typically assessed by asking the child to write alphabet letters or 
to copy sentences or paragraphs as accurately and quickly as 
possible within a specified time (e.g., Abbott & Berninger, 1993; 
Graham et al., 1997; Kim et al., 2011; Wagner et al., 2011). 

Although ideation, the other component of writing according to 
the simple view of writing, is challenging to directly measure, it 
has been largely measured by means of oral language use (e.g., 
Chenoweth & Hayes, 2003; Hayes, 2012). Generated ideas cannot 
be produced without being translated into oral language because 
the child has to express ideas using appropriate words, encode 
them using appropriate syntactic structure, and organize and pres- 
ent them in a logical sequence. Therefore, oral language profi- 
ciency would determine how the generated ideas are adequately 
expressed. Evidence of the importance of oral language in written 
composition is accumulating from beginning writers to those in 
middle school (Berninger & Abbott, 2010; Kent, Wanzek, Pet- 
scher, Al Otaiba, & Kim, in press; Kim et al., 2011, 2013, 2014; 
Olinghouse, 2008) as well as children with language impairment 
(Dockrell, Lindsay, & Connelly, 2009; Dockrell & Connelly, in 
press; Kim, Puranik, & Al Otaiba, in press; Puranik, Lombardino, 
& Altmann, 2007). Given that writing is a production or 
constructed-response task, children’s transcription skills constrain 
the extent to which generated ideas can be transcribed into gener- 
ated text (Berninger, Abbott, Abbott, Graham, & Richards, 2002; 
Juel et al., 1986). 

In addition to these previously noted skills, the not-so-simple 
view of writing states that executive function processes such as 
attention, planning, self-regulation, and working memory are crit- 
ical supports for writing development (Berninger & Winn, 2006). 
Attention, in particular, has been shown to be related to writing for 
children in first and second grade (Hooper et al., 2011; Hooper, 
Swartz, Wakely, de Kruif, & Montgomery, 2002; Kent et al., in 
press; Kim, Al Otaiba, et al., 2014). Additional evidence under- 
scoring the importance of attention in writing comes from studies 
with children who have attention deficits or attention-deficit/hy- 
peractivity disorder (ADHD); converging evidence suggests that 


students with ADHD made more spelling and grammatical errors 
(Casas, Ferrer, & Fortea, 2013; Gregg, Coleman, Stennett, & 
Davis, 2002; Re, Pedron, & Cornoldi, 2007), made more content 
errors or digressions and demonstrated weaker text structure fea- 
tures than children without ADHD (Casas et al., 2013). 

Individual differences in reading also have been shown to matter 
for children’s writing development (Shanahan, 2006). Studies have 
shown that reading comprehension was related to written compo- 
sition quality and productivity for children in elementary and 
middle school grades (Berninger & Abbott, 2010; Berninger et al., 
2002; Kim, Al Otaiba, et al., 2013, 2014). Children’s reading 
ability might influence written composition skill via reading ex- 
periences. Greater reading ability and consequent text reading 
might allow the opportunity for the child to acquire vocabulary and 
syntactic structures, and organization of written text as well as 
content (Berninger et al., 2006). In fact, children with impaired 
reading comprehension had weaker story content and organization 
in their writing (Cragg & Nation, 2006). 

Writing involves juggling of multiple processes to even greater 
extent than in reading. Therefore, the ability to coordinate multiple 
aspects is likely to be important. Some previous studies have 
examined rapid automatized naming (RAN) in this regard as a 
potential predictor of writing. Numerous studies have shown that 
rapid automatized naming is related to reading (Compton, DeFries, 
& Olson, 2001; de Jong, & van der Leij, 2003; Kim, 2011; Kirby, 
Parrila, & Pfeiffer, 2003; Savage et al., 2005; Wolf & Bowers, 
1999; Wolf & O’Brien, 2001). However, despite a robust relation 
to reading in various languages, researchers differ about what it 
exactly measures, and hypotheses include phonological processing 
(Wagner & Torgesen, 1987), automaticity of processes (Bowers, 
1995; LaBerge & Samuels, 1974; Spring & Davis, 1988), global 
processing speed (Kail & Hall, 1994), and multiple constructs such 
as lexical access, automaticity, attentional, visual, and articulatory 
processes (Wolf & Bowers, 1999). If RAN measures automaticity 
of processes, its influence might largely overlap with that of 
automaticity of transcription skills and thus may not be related to 
writing over and above transcription skills. In contrast, if RAN 
captures multiple constructs beyond what is captured by transcrip- 
tion skills, it would be related to writing over and above transcrip- 
tion skills. Although RAN has not been examined for young 
English-speaking children, there is some emerging evidence from 
studies with Chinese children that suggests that RAN is related to 
writing (Chan, Ho, Tsang, Lee, & Chung, 2006; Ding, Richman, 
Yang, & Guo, 2010; but see Yan et al., 2012). 


Gender and Writing 


Gender appears to matter in children’s writing achievement. 
Girls have outperformed boys in writing consistently across grades 
ever since writing was included in the National Assessment of 
Educational Progress (National Center for Education Statistics, 
2011). For instance, in 2002, in which writing was assessed in 
children in Grade 4 as well as those in Grades 8 and 12, girls 
outperformed boys in all the three grades with gaps ranging from 
17 to 25 points (National Center for Education Statistics, 2003). 
Similarly, gender gaps have been reported for children in elemen- 
tary grades (Berninger & Fuller, 1992; Knudson, 1995). Despite 
these consistent gender gaps in writing, our understanding about 
gender gaps in writing and potential sources of gender gaps is 
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limited, particularly for children in elementary grades. One poten- 
tial source of gender gaps seen in older students is their attitude 
toward writing. Among adolescents, males tend to have less pos- 
itive attitudes toward writing than do females (Knudson, 1992; 
Pajares & Valiante, 1999), and see less value in writing and 
express less satisfaction with writing activities (Lee, 2013). Stud- 
ies of younger students have reported mixed findings about the 
relation of attitude toward writing and children’s writing skill. 
Knudson (1995) investigated gender and writing attitude with 
children in Grades 2 and 6 and found that children’s attitude 
toward writing predicted their writing skill. In contrast, a study 
with children in Grades 1 and 3 revealed that girls had more 
positive attitudes than boys toward writing as early as in Grade 1, 
but this difference was not related to their writing skill (Graham, 
Berninger, & Fan, 2007). 

Another potential source of gender gaps in writing achievement 
is reading or reading-related skills. As noted earlier, evidence 
suggests that reading is one of the component skills of writing. 
Evidence also indicates that male students have been consistently 
outperformed by female students in reading (e.g., National Center 
for Education Statistics, 2011), and a greater number of boys are 
identified with reading disabilities (Hawke, Olson, Willcutt, Wads- 
worth, & DeFries, 2009; Miles, Haslum, & Wheeler, 1998; Yosh- 
masu et al., 2010; but see Shaywitz, Shaywitz, Fletcher, & Esco- 
bar, 1990). Therefore, differences in reading or reading-related 
skills might explain differences in writing skills between boys and 
girls. Furthermore, boys in Grades 1, 2, and 3 had lower scores in 
another writing component skill, transcription skill (Berninger & 
Fuller, 1992). In the present study, we examined whether gender 
differences were found for children in Grades 2 and 3 in the 
identified writing dimensions, and if so, to what extent gender 
differences were explained by the included language and cognitive 
skills (e.g., reading, attention, and transcription skills). 


Present Study 


The primary goal of the present study was to examine the 
dimensionality of various writing evaluation approaches, predic- 
tors of various dimensions, and the gender gap in writing. Specific 
research questions were as follows. 

1. What are the relations of CBM writing measures (i.e., CIWS 
and %CWS) and the WJ-III Writing Fluency task to writing 
quality and writing productivity indicators? Are CBM writing 
measures and the WJ-—III Writing Fluency task measure disso- 
ciable dimensions from writing quality and writing productivity? 

2. How are language and cognitive skills related to the identified 
writing dimensions? 

3. Are there performance differences between boys and girls in 
the identified writing dimensions (e.g., writing quality and pro- 
ductivity) after accounting for children’s language and cognitive 
skills? 

To address these questions, we used data from second- and 
third-grade children (V = 494) who were administered multiple 
writing tasks: written compositions in response to three prompts 
(one normed task and two experimental tasks) and a sentence-level 
task, the WJ-III Writing Fluency task. Students’ compositions 
were evaluated using a variety of approaches including writing 
quality indicators such as idea development and organization, 
writing productivity indicators such as number of words written 


and number of ideas, CBM writing scoring such as CIWS and % 
CWS, and scoring protocols in the standardized tasks. Language 
and cognitive skills included oral language, reading, transcription 
(spelling and handwriting fluency), attention, and rapid automa- 
tized naming. 

We hypothesized that writing quality and productivity would be 
dissociable dimensions based on previous studies (Kim, Al Otaiba, 
et al., 2014; Puranik et al., 2008; Wagner et al., 2011). We also 
hypothesized that the CBM writing scores would be a dissociable 
construct because although validity coefficients of CIWS and 
%CWS were acceptable, they are not extremely highly correlated 
with other writing measures (e.g., Amato & Watkins, 2011; Mc- 
Master & Espin, 2007). In contrast, we did not have a priori 
prediction about the WJ-III Writing Fluency task. It was also 
hypothesized that various language and cognitive skills would be 
differentially related to different writing outcomes based on a prior 
study (Kim, Al Otaiba, et al., 2014); Finally, gender differences 
were hypothesized, and language and literacy skills were expected 
to explain gender differences to some extent. 


Method 


Participants and Sites 


Students in the present study included 494 children in Grades 2 
(mean age = 7.80 years) and 3 (mean age = 8.82 years). These 
students were drawn from 76 classrooms in 10 schools in a 
midsized city. The students were 51.2% male, and 76.1% received 
free or reduced-price lunch. Six of the 10 schools were Title I 
schools, indicating that the majority of the students in the school 
were eligible for the free or reduced-price lunch program. Stu- 
dents’ racial backgrounds were as follows: 60% African Ameri- 
cans, 29% Whites, and the rest Asians and multiracial children. 
The students and their families had consented for their participa- 
tion, and all guidelines for human research protection continued to 
be followed in the present study. 


Measures 


Writing tasks. Four tasks were used to assess children’s writ- 
ten composition skill: two standardized and normed tasks, and two 
experimental prompts. The first task was the Writing Fluency 
subtest of the Woodcock-Johnson Tests of Achievement (3rd ed., 
or WJ-III; Woodcock et al., 2001). In this subtest, students were 
provided with a series of pictures and three corresponding words 
and were instructed to write a sentence about the picture that 
included the words given. Students were given 7 min to complete 
as many sentences as they could. For the scoring of this subtest, we 
used standard scoring procedures outlined in the testing manual. 
Namely, students received 1 point for each complete sentence. In 
order to receive credit, the sentence had to be clear in meaning and 
include critical words to make the sentence reasonable. Students 
were not penalized for errors in punctuation, spelling, or capital- 
ization, or for poor handwriting. Using the Rasch analysis proce- 
dure, the reliability coefficient was reported to be .72 for 7- and 
8-year-olds (McGrew, Schrank, & Woodcock, 2007). 

We also asked children to write on three prompts: one prompt 
from the Written Essay test of the Wechsler Individual Achieve- 
ment Test (3rd ed., or WIAT-III; Wechsler, 2009) and two exper- 
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imental prompts, one narrative prompt and one expository prompt. 
The WIAT-III was selected as a widely used writing assessment 
that could be compared with other research (e.g., see Berninger & 
Abbott, 2010). In the WIAT-III Essay task, children were asked to 
write about a favorite game and include at least three reasons as 
support. Note that standard scores in this task are available starting 
with children in Grade 3 and not available for children in Grade 2. 
Despite lack of standard scores, this task was deemed useful for 
children in Grade 2 for the purpose of examining dimensionality 
and predictive relations. In addition, assessors confirmed that this 
topic did not appear to be difficult for children in Grade 2. 

The experimental narrative prompt was “One day when I got 
home from school . . ..” Children were asked to write about any 
interesting events that occurred responding to the prompt (Kim et 
al., 2013, Kim, Al Otaiba, et al., 2014; McMaster et al., 2009; 
2011). The experimental expository prompt was adapted from a 
previous study (Wagner et al., 2011). In this task, children were 
asked to write about a classroom pet they would like and explain 
why. For each prompt, children were given a 10-min time limit. 


Writing Evaluation 


Children’s written compositions for the WIAT Essay Composi- 
tion task and two experimental prompts were evaluated on writing 
quality, writing productivity, and CBM writing measure scoring 
(discussed later). In addition, the WIAT essay was scored accord- 
ing to the examiner’s manual. Children’s responses to the WJ—II 
Writing Fluency task were evaluated only according to the exam- 
iner’s manual previously noted because the responses were sen- 
tences, not passage-level composition. 

Writing quality scoring. The quality of children’s written 
composition was evaluated on the extent to which their ideas were 
developed and the extent to which the ideas were presented in an 
organized manner, on a rating scale of 1 to 7. In this idea devel- 
opment aspect, high scores were given to compositions with rich 
and detailed ideas and ideas with unique and interesting perspec- 
tives. In the organization aspect, compositions with logical se- 
quence and transitions of expressed ideas with overall structure of 
beginning, middle, and end received high scores. These were 
similar to the 6-point scale version of the 6 + 1 Trait Rubric but 
were adapted to a 1~7 rating scale, representing low quality and 
high quality, respectively. When using 45 writing samples per 
prompt, reliabilities (Cohen’s kappa) ranged from .82 to .88 for 
ideas and organization. 

Writing productivity scoring. Two indicators were used for 
writing productivity: total number of words written and number of 
ideas. The number of words has been widely used as an indicator 
of compositional productivity in writing (e.g., Abbott & Berninger, 
1993; Berman & Verhoevan, 2002; Kim et al., 2011; Mackie & 
Dockrell, 2004; Puranik et al., 2008; Scott & Windsor, 2000; 
Wagner et al., 2011). Words were defined as real words recogniz- 
able in the context of the child’s writing despite some spelling 
errors. Random strings of letters or sequences were not counted as 
words. Random strings of letters were identified by comparing a 
record of what the child said she had written to her written 
composition. These were extremely rare in the sample (less than 
10). The number of ideas was a total number of propositions, 
which were defined as predicate and argument. For example, “I 
went upstairs and took a bath” was counted as two ideas (see, ¢.g., 


Kim et al., 2011, 2013; Puranik et al., 2008). Repeated ideas were 
only counted once. When using 45 writing samples per prompt, 
reliabilities were .88 for the number of ideas (kappa) and .99 for 
the number of words (similarity). 

Curriculum-based measure scoring. Each essay was indi- 
vidually analyzed for curriculum-based measures (CBM) includ- 
ing the correct word sequence (“any two adjacent, correctly 
spelled words that are acceptable within the context of the sam- 
ple”; McMaster & Espin, 2007, p. 76), and the incorrect word 
sequence (“any two adjacent letters that are incorrect’; McMaster 
& Espin, 2007, p. 76). From these, a correct minus incorrect word 
sequence (CIWS) was obtained by subtracting incorrect words 
from correct word sequence. The percentage of correct word 
sequences (%CWS) was calculated by dividing the number of 
CWS by the total number of words written. In the data analysis, we 
used CIWS and %CWS for two reasons: (a) number of words 
written has been used as an indicator writing productivity and, 
thus, is not unique to CBM writing, and correct word sequence is 
highly related to the number of words written (because children 
who write more tend to have greater number of correct sequences); 
and (b) evidence indicates that CIWS and %CWS have greater 
validity coefficients with other writing tasks than the other CBM 
writing scoring (e.g., McMaster & Espin, 2007). Reliability for 
each type of scoring was established using 45 pieces per prompt. 
We used an equation that produced quotients to indicate the 
proximity of the coder’s score for each measure to that of the 
primary coder (i.e., similarity coefficients; Shrout & Fleiss, 1979), 
and reliability for each measure ranged from .92 to .99. 

WIAT standardized scoring. In addition to the previously 
noted evaluative measures, students’ compositions for the WIAT 
Essay Composition task were scored according to the manual. The 
WIAT scoring includes the total number of words, thematic de- 
velopment, and text organization (theme and organization hereaf- 
ter), and a supplemental score called the grammatical score. The 
grammatical score is highly similar to CIWS in CBM writing 
although slight differences are found in operationalization (e.g., 
WIAT does not give credit for titles or endings such as “The End,” 
whereas conventional CBM writing does). The unique scoring in 
the WIAT task, thus, is the theme and organization, and students’ 
compositions were assigned scores in the following categories: 
introduction, conclusion, paragraphs, transitions, reasons why, and 
elaborations. The maximum score possible for the theme and 
organization component was 20 points. Interrater reliability was 
established by having two independent coders score 50 essays and 
comparing individual points assigned. The number of agreements 
was divided by the total number of agreements plus disagreements, 
resulting in a reliability coefficient of .85. A standard score for 
theme and organization was computed for each student based on 
his or her chronological age at the time of testing. The standard 
score for the WIAT Essay Composition task is a composite of the 
standard score for theme and organization and for total number of 
words written. 


Predictors 


Predictors were selected based on our review of the literature 
and included oral language, reading, spelling, handwriting fluency 
(letter writing and story copying tasks), attention, and rapid au- 
tomatized naming. 
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Oral language. Children’s oral language skill was measured 
by the following three tasks: WJ-III Picture Vocabulary (Wood- 
cock et al., 2001), the Test of Narrative Language Narrative 
Comprehension subtest (Gillam & Pearson, 2004), and the Oral 
and Written Language Scales Listening Comprehension subtest 
(Carrow-Woolfolk, 2011). In the Picture Vocabulary task, children 
were asked to identify pictured objects. Test-retest reliability is 
reported as .71—.73 for 7- and 8-year-olds (McGrew et al., 2007). 
The Narrative Comprehension subtest of the Test of Narrative 
Language includes three individual tasks in which each student 
listens to a short story and is then asked to answer specific 
comprehension questions. The internal consistency of this subtest 
is .87, and test-retest reliability is .85 (Gillam & Pearson, 2004). 
In the Listening Comprehension subtest of the Oral and Written 
Language Scales, students listen to a stimulus sentence and are 
asked to point to one of four pictures that corresponds to the 
sentence read aloud by the tester. This subtest’s reported split-half 
internal reliability ranges from .96 to .97 for the age group of our 
sample (Carrow-Woolfolk, 2011). 

Reading. Children’s reading skill was assessed using five 
measures: the WJ-III Letter Word Identification and Passage 
Comprehension subtests (Woodcock et al., 2001), the Sight Word 
Efficiency subtest of the Test of Word Reading Efficiency (2nd 
ed., Torgesen, Wagner, & Rashotte, 2012), the Oral Reading 
Fluency subtest of the WIAT-III (Wechsler, 2009), and the Test of 
Silent Reading Efficiency and Comprehension (Wagner, Torgesen, 
Rashotte, & Pearson, 2010). For the Letter Word Identification 
task, the child is asked to read aloud letters and words of increasing 
difficulty. For the WJ-III Passage Comprehension subtest, stu- 
dents are asked to silently read a short passage and provide a 
missing word that makes sense within the context of the passage. 
Reliabilities (test-retest) are reported to be .96 for both the Letter 
Word Identification and the Passage Comprehension subtests for 
students in the age range of the students we assessed (McGrew et 
al., 2007). In the Sight Word Efficiency task, the child is asked to 
read words of increasing difficulty with accuracy and speed. Test— 
retest reliability for the Sight Word Efficiency is reported to be .93 
for 6- and 7-year-olds and .92 for 8- to 12-year-olds. For the WIAT 
Oral Reading Fluency task, the child is asked to read two grade- 
level passages aloud. The student is timed during both readings, 
and the completion time is recorded in seconds for each prompt. 
Each raw score is then used to compute an average weighted raw 
score to determine oral reading fluency. Test—retest reliability for 
the WIAT Oral Reading Fluency subtest is reported as .93. For the 
Test of Silent Reading Efficiency and Comprehension, the student 
is given 3 min to read a series of statements and determine if each 
statement is true or not. The authors report alternate-form reliabil- 
ity coefficients ranging from .87 to .95 for students in Grades 2 
and 3. 

Spelling. Children’s spelling skill was measured by a dicta- 
tion task, the WJ-III Spelling subtest (Woodcock et al., 2001). 
Once a student misspells six consecutive words, the test is discon- 
tinued. The authors of this assessment report test-retest reliability 
coefficients of .91 and .88 for 7- and 8-year-olds, respectively. 

Letter writing automaticity. The WIAT-III Alphabet Writ- 
ing Fluency task was used, in which children were asked to write 
as many letters of the alphabet as possible with accuracy. This task 
assesses how well children access, retrieve, and write letter forms 
automatically. Research assistants asked children to write as many 


letters of the alphabet as they could in a 30-s time period. Children 
received a score for the number of correctly written letters. One 
point was awarded for each correctly formed letter. Interrater 
reliability (Cohen’s kappa) for this subtest was .88 for our sample. 

Story copying. Another transcription skill, the ability to copy 
letters, was measured by an experimental story copying task. In 
this task, students were instructed to copy a narrative story titled 
“Can Buster Sleep Inside Tonight?” as fast as they could. The story 
had 519 words and involves a dog named Buster being muddy and 
being bathed so that he could sleep inside. Students were given | 
min to write as much of the story verbatim as possible. Children 
received a score for the number of letters correctly formed, which 
was calculated as the difference between the number of letters 
attempted and the number of letter errors made. Interrater reliabil- 
ity (Cohen’s kappa) for this measure was established at .91. 

Attention. The first nine items of the Strengths and Weaknesses 
of ADHD Symptoms and Normal Behavior Scale (¢.g., SWAN; 
Swanson et al., 2006) were used to measure children’s attentiveness. 
SWAN is a behavioral checklist that includes 30 items that are rated 
on a 7-point scale ranging from a score of | (far below average) to 7 
(far above average) to allow for ratings of relative strengths (above 
average) as well as weaknesses (below average). The first nine items 
are related to sustaining attention on tasks or play activities (€.g., 
“Engages in tasks that require sustained mental effort”) while the 
other items assess hyperactivity and aggression. A recent study 
showed that the first nine items indeed captures the respondent’s 
ability to regulate attention (Sdez, Folsom, Al Otaiba, & Schatsch- 
neider, 2012). Higher scores represent greater attentiveness. Teachers 
completed the SWAN checklist in the spring. Cronbach’s alpha across 
the nine items was .91. 

Rapid automatized naming. The Letters subtest of the Rapid 
Automatized Naming (RAN) test (Wolf & Denckla, 2005) was 
used. For this subtest, each examinee’s completion time for nam- 
ing a series of alternating lowercase letters was recorded. Test— 
retest reliability is .89 for children in elementary grades (Wolf & 
Dencckla, 2005). 


Procedures 


All assessments for the current study were conducted during the 
spring of the school year. Research assistants were trained prior to 
each assessment round, which consisted of two individual rounds and 
two small group sessions. Each research assistant spent approximately 
2 hr in training and subsequent practice sessions for each round of 
assessments and was required to pass a fidelity check before admin- 
istering assessments to the participants in order to ensure accuracy in 
administration and scoring. The trained research assistants assessed 
children individually during two sessions; the first session included 
the Test of Word Reading Efficiency, Test of Narrative Language 
Narrative Comprehension subtest, RAN, and WIAT Oral Reading 
Fluency, and the second session included the WJ-III subtests and the 
Oral and Written Language Scales Listening Comprehension test. We 
varied the order of assessments within each session across children in 
order to reduce fatigue effect. Then, all spelling and writing assess- 
ments were administered in small groups over two additional sessions. 
Throughout the assessments, students were given breaks as needed. 
Trained research assistants scored students’ letter writing automatic- 
ity, story copying, spelling, and writing, and research assistants were 
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trained to use each rubric on a small subset of the sample through 
practice and discussion of scoring issues. 


Data Analysis Strategy 


Primary analytic strategies were confirmatory factor analysis 
(CFA) and multilevel modeling. In the latent variable approach (e.g., 
CFA, common variance among multiple indicators is used for a 
construct, and thus measurement error is reduced (Bollen, 1989, 
Kline, 2005). The first research question, dimensionality of writing, 
was examined using CFA. Assumptions (univariate and multivariate 
normality) were checked prior to analysis and were met. Model fits 
were evaluated using the following multiple indices: chi-square, com- 
parative fit index (CFI), Tucker—Lewis index (TLI), root-mean-square 
error of approximation (RMSEA), and standardized root-mean-square 
residuals (SRMR). Differences in model fits for two nested models 
were evaluated by comparing chi-square differences between the two 


models. Confirmatory factor analysis was conducted using Mplus 7 
(Muthén & Muthén, 2012). Because children were nested within 
classroom and schools, the Research Questions 2 and 3 were ad- 
dressed using three-level multilevel modeling. PROC MIXED proce- 
dure of SAS 9.3 was used. Factor scores from CFA models (e.g., 
scores in the identified writing dimensions) were used in the multi- 
level modeling. 


Results 


Descriptive Statistics and Factor Analysis 


Table 1 shows the means and standard deviations of writing 
scores by grade and gender. Where available, standard scores are 
presented. Note that the WIAT writing composition task was not 
normed for children in second grade, and thus, standard scores are 








Table 1 
Means (Standard Deviations) of Writing Measures 
Grade 3 Grade 2 
Variable Entire sample Males Females Entire sample Males Females Loadings 
WIAT total raw score* 88.82 (35.23) 81.90 (32.90) 97.55 (36.25) 79.89 (36.09) 71.35 (36.35) 87.21 (34,35) NA 
WIAT total score: SS 107.92 (13.74) 105.73 (14.06) 110.68 (12.87) NA NA NA NA 
WIAT theme & organization raw 6.58 (2.84) 6.28 (2.90) 6.97 (2.74) 5.62 (2.59) 5.38 (2.65) 5.83 (2.53) ey 
WIAT theme & organization SS  105.09(16.02) 103.71 (16.53) 106.83 (15.24) NA NA NA NA 
W3J-IIl Writing Fluency raw 13.62 (4.41) 12.89 (4.11) 14.55 (4.62) 10.11 (4.97) 9.48 (4.92) 10.68 (4.96) 67 
W3J-IIl Writing Fluency SS 98.24 (15.78)  96.34(13.85)  100.63(17.69)  95.09(26.54) 93.44 (24.29) 96.57 (28.44) NA 
Writing quality indicators 
WIAT idea quality 3.89 (0.88) 3.81 (0.83) 3.98 (0.93) 3.44 (0.76) 3.32 (0.81) 3.55 (0.71) 66 
WIAT organization 3.25 (0.89) 3.21 (0.86) 3.30 (0.94) 2.88 (0.82) 2.79 (0.86) 2.96 (0.79) 10 
Narrative idea quality 4.46 (1.00) 4,30 (0.92) 4.66 (1.06) 4,30 (1.10) 3.99 (1.17) 4.19 (1.04) 65 
Narrative organization 3.56 (0.87) 3.44 (0.76) 3.71 (0.97) 3.16 (0.78) 3.10 (0.84) 3.21 (0.73) 63 
Pet idea quality 3.76 (0.80) 3.66 (0.76) 3.88 (0.83) 3.55 (0.81) 3.39 (0.80) 3.70 (0.80) 54 
Pet organization 2.96 (0.69) 2.92 (0.71) 3.02 (0.67) 2.66 (0.69) 2.53 (0.65) 2.77 (0.69) 60 
CBM scores 
WIAT CWS 63.53 (31.85) 57.40 (30.30) 71.19 (32.27)  52.93(29.97)  45.40(28.15) 59.23 (20.11) NA 
WIAT IWS 26.43 (16.10) 25.47 (14.16) 27.71 (18.22) = 29.55 (20.52)  27.94(22.24) 30.91 (18.93) NA 
Narrative CWS 66.35 (35.35) 59.94 (31.00) 74.63 (38.89)  54.68(28.01)  48.50(25.65) 30.91 (18.93) NA 
Narrative IWS 32.36 (18.33) 31.63 (17.25) 33.30 (19.68) 37.19 (23.50) 34.82 (25.38) 39.16 (21.73) NA 
Pet CWS 62.87 (34.33) 54.31 (30.10) 73.22 (36.34)  54.10(33.52)  45.25(29.47) 61.69 (35.01) NA 
Pet IWS 25.57 (17.83) 24.91 (18.32) 26.36 (17.27)  27.40(21.29)  26.25(21.29) 28.39 (21.28) NA 
WIAT %CWS 76 (18) 74 (18) 78 (18) 69 (22) 68 (22) 71 (22) 87 
Narrative ZCWS 73 (17) F217) 74 (18) 66 (20) 66 (20) 66 (20) 80 
Pet ZCWS 78 (20) 75 (21) 81 (19) 72 (22) 69 (22) 74 (22) 78 
WIAT CIWS 37.09 (35.02) 31.99 (33.18) 43.48 (36.35)  23.46(34.10)  17.66(32.66) 28.32 (34.65) 87 
WIAT CIWS SS 100.35 (17.13)  97.96(16.55) 103.36 (17.44) NA NA NA NA 
Narrative CIWS 33.99 (36.83) 28.31 (31.25) 41.33 (42.00) 17.48 (31.69) 13.67 (31.88) 20.64 (31.32) 85 
Pet CIWS 37.31 (36.71) 29.40 (31.85) 46.86 (39.94) 26.58 (34.84)  18.74(29.86) 33.30 (37.43) 19 
Writing productivity indicators 
WIAT no. of words 82.38 (33.66) 75.83 (31.17) 90.58 (34.97) 74.77 (34.99) 66.36 (34.90) 81.88 (33.59) 87 
WIAT no. of words SS 109.00 (13.49) 106.51 (13.14) 112.15 (13.33) NA NA NA NA 
Narrative no. of words 89.44 (38.38) 82.01 (34.08) 99.03 (41.52) 82.36 (38.12) 73.79 (36.56) 89.47 (38.06) 84 
Pet no. of words 80.05 (36.92) 72.52 (35.24) 89.14 (37.01)  73.86(41.14)  64.33(39.14) 82.11 (41.21) 16 
WIAT no. of ideas 12.07 (5.04) 11.28 (4.80) 13.06 (5.18) 11.73 (5.51) 10.44 (5.34) 12.82 (5.43) .80 
Narrative no. of ideas 15.62 (6.69) 14.30 (6.02) 17.31 (7.14) 14.26 (6.80) 12.76 (6.46) 15.50 (6.84) 18 
Pet no. of ideas 12.95 (5.84) 11.74 (5.52) 14.41 (5.91) 11.69 (6.01) 10.47 (5.84) 12.76 (5.98) 69 


en 88 SSS S00 oo 
Note. WIAT = Wechsler Individual Achievement Test (3rd edition); SS = standard score; WJ-III = Woodcock—Johnson Tests of Achievement (3rd 


edition); Narrative = Test of Narrative Language; Pet = 
incorrect word sequences; CIWS = ec 
(curriculum-based measurement) CBM writing, and productivity, NA 


Writing Fluency raw were those when they were considered as indicators of the writing quality. 
4 Words written + theme and organization. 


correct minus incorrect word sequences. Loadings were for the following laten 
= not applicable. The loadings of the WIAT theme and organization and WJ-III 


pet prompt; CBM = curriculum-based measurement; CWS = correct word sequences; [WS = 
t variables: writing quality, 
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not presented for this grade. In the WIAT writing composition, the 
standard score is a composite of the standard scores from the 
number of words written and the theme development and organi- 
zation standard scores. The standard score in the WIAT writing 
task is in the average range albeit slightly in the high average for 
children in Grade 3 (mean standard score [SS] = 107.92, SD = 
13.74). Standard scores in the WJ-III Writing Fluency task was in 
the average range as well (mean SS = 98.28 and 95.09 for Grades 
3 and 2, respectively). The WIAT Grammar Score, which is CIWS 
CBM writing, was in the average range (mean SS = 100.25) for 
students in Grade 3. However, note that the standard scores for the 
Grammar Score should be viewed with caution due to slight 
differences in scoring CIWS between WIAT and our approach 
following previous studies (e.g., McMaster et al., 2009). 

Table 2 displays descriptive statistics for language and literacy 
predictors by grade and gender. In the language measures, mean 
performance was in the average range—from 8.65 on the Test of 
Narrative Language Narrative Comprehension subtest to 99.13 on 
the WJ-III Picture Vocabulary task. Children’s reading skills, 
spelling, and alphabet writing fluency were also in the average 
range. Correlations are presented in Table 3 for writing variables 
and in Table 4 for language and cognitive variables. Preliminary 
analysis showed that the patterns of relations were highly similar 
for children in Grades 2 and 3, and thus, results from combined 
data are presented. The writing quality variables tended to be 





moderately and statistically significantly related to each other 
while writing productivity variables (number of words and number 
of ideas) were highly related to each other. Given that RAN has not 
been examined in relation to writing in previous studies with 
English-speaking children, correlations of RAN to writing scores 
are presented in Table 3. RAN was weakly to moderately related 
to all the writing variables (—.24 = rs = —.43). Language and 
cognitive variables in Table 4 were all statistically significantly 
correlated in expected directions. 


Dimensionality of Writing 


In order to examine the dimensionality captured in the various 
writing evaluation measures, we conducted a series of analysis. 
First, we confirmed the hypothesized factor structure using CFA 
models (measurement models) for the writing quality and produc- 
tivity. Writing quality and productivity were deemed to be a good 
place to start because previous studies indicated that they are 
dissociable dimensions and their indicators are fairly well under- 
stood (Kim, Al Otaiba, et al., 2014; Kim, Park, & Park, 2013; 
Puranik et al., 2008; Wagner et al., 2011). Second, we examined 
measurement model (i.e., CFA) of the CBM writing scores and its 
relation to writing quality and writing productivity dimensions. 
Finally, we examined whether the WJ-III Writing Fluency is best 








Table 2 
Means (Standard Deviations) of Language and Literacy Predictors by Gender 
Grade 3 Grade 2 

Variable Entire sample Males Females Entire sample Males Females Loadings 
OWLS raw 87.09 (10.71) 87.29 (10.20) 86.84 (11.37) 82.54 (11.00) 81.15 (11.25) 83.74 (10.69) 74 
OWLS SS 98.09 (13.48) 98.12 (13.01) 98.05 (14.12) 101.40 (12.47) 99.57 (12.59) 102.98 (12.19) NA 
TNL Narrative Comprehension raw 28.34 (4.60) 27.91 (4.67) 28.87 (4.47) 26.60 (4.89) 25:791(6.29) 27.29 (4.42) .10 
TNL Narrative Comprehension SS 8.65 (3.06) 8.33 (2.92) 9.04 (3.21) 8.33 (2.70) 7.86 (2.70) 8.73 (3.21) NA 
WJ-III Picture Vocabulary raw 23.24 (3.16) 23221882) 23.28 (2.96) 21.50 (3.19) 21.58 (3.07) 21.43 (3.30) 75 
WJ-III Picture Vocabulary SS 99.13 (10.41) 99.00 (10.89) 99.30 (9.79) 98.74 (10.61) 98.87 (9.92) 98.64 (11.21) NA 
W/J-IIl LWID raw 50.26 (6.64) 50.04 (6.74) 50.54 (6.53) 44.37 (7.31) 43.84 (7.47) 44.82 (7.18) 85 
WJ-III LWID SS 104.84 (11.04) 104.40 (11.21) 105.41 (10.85) 105.89 (11.41) 104.84 (11.34) 106.80 (11.44) NA 
Sight Word Efficiency raw 62.88 (11.59) 61.79 (10.77)  64.20(12.45) 54.45 (13.19) 53.43 (14.24) 55.31 (12.22) .89 
Sight Word Efficiency SS 96.26 (15.02) 94.50(13.95) 98.43 (16.04) 99.57 (15.20) 98.05 (16.29) 100.84 (14.16) NA 
WIAT-ORF1 60.09 (23.48) 61.70 (23.14) 58.04 (23.86) 65.67 (28.27) 67.04 (29.86) 64.50 (26.89) NA 
WIAT-ORF2 71.69 (28.42) 74.06 (27.75) 68.70 (29.91) 77.83 (41.70) ~—83.45 (49.13) 73.02 (33.54) NA 
WIAT-ORF weighted raw 105.17 (35.81) 101.75 (34.41) 109.46 (37.21) 88.57 (33.43) 85.10 (33.81) 911.44 (32.96) 88 
WIAT-ORE SS 103.49 (14.98) 101.85 (14.60) 105.54 (15.25) 99.23 (13.58) 97.37 (13.92) 100.78 (13.14) NA 
TOSREC raw 25.24 (9.27) 24.21 (9.39) 26.48 (9.02) 26.09 (9.75) 24.80 (9.90) 27.22 (9.52) 19 
TOSREC SS 101.22 (16.37)  99.36(16.61) 103.48 (15.85) 98.64 (15.22) 96.64 (15.33) 100.37 (14.97) NA 
WJ-III Passage Comprehension raw 25.77 (3.49) 25.66 (3.52) 25.91 (3.47) 23.44 (3.92) 23.07 (3.95) 23.76 (3.87) .80 
WJ-III Passage Comprehension SS 95.38 (9.48) 95.05 (9.56) 95.82 (9.41) 97.51 (9.45) 96.44 (9.40) 98.45 (9.42) NA 
WIAT Alphabet Writing Fluency raw 17.57 (6.35) 17.32 (6.40) 17.87 (6.31) 15.91 (6.04) 15.76 (6.10) 16.03 (6.00) NA 
WIAT Alphabet Writing Fluency SS_—104.74 (18.95) 104.12 (18.33) 105.53 (19.77) 104.62 (17.31) 104.22 (16.85) 104.97 (17.77) NA 
WJ-III Spelling raw 32.63 (5.88) 32.57 (5.88) 32.70 (5.90) 28.85 (5.51) 28.56 (5.88) 29.09 (5.20) NA 
WJ-III Spelling SS 102.75 (14.45) 102.50 (14.70) 103.06 (14.20) 102.58 (13.92) 101.46 (14.40) 103.53 (13.50) NA 
Story copying: letters correct 36.21 (15.96) 33.51 (13.99) 39.60 (17.62)  27.55(11.59) 26.17 (12.35) 28.75 (10.80) NA 
SWAN Attention 34.36 (10.13) 33.01 (9.73) 36.08 (10.42) 36.68 (11.90) 33.19 (11.77) 39.62 (11.24) NA 
RAN Time 28.40 (7.15) 28.45 (6.76) 28.35 (7.64) 32.53 (8.88) 33.12 (9.82) 32.03 (8.01) NA 
RAN SS 99.88 (12.63) 99.37(11.98) 100.54 (13.43) 99.47 (12.84) 98.57 (13.24) 100.23 (12.49) NA 
Note. Raw = raw score; OWLS = Oral and Written Language Scale; TNL = Test of Narrative Language; SS = standard score; WJ-IIL = 


Woodcock—Johnson Tests of Achievement (3rd edition); LWID = Letter Word Identification, WIAT = Wechsler Individual Achievement Test (3rd 
edition); ORF = Oral Reading Fluency; TOSREC = Test of Silent Reading Efficiency and Comprehension; SWAN = Strengths and Weaknesses of ADHD 
Symptoms and Normal Behavior Rating Scale; RAN = Rapid Automatized Naming Test. Loadings were for oral language latent variable (OWLS, TNL, 
& WJ Picture Vocabulary) and reading latent variable (WJ Letter Word Identification, Test of Word Reading Efficiency Sight Word Efficiency, WIAT 


ORF, TOSREC, and WJ Passage Comprehension); NA = not applicable. 
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Table 4 
Correlations Among Language and Cognitive Variables 
Variable 1 2 3 4 5 6 7 8 9 10 11 12 
el wal Se ee 1 ee ee a OE SS eee 
1. OWLS 1.00 
2. TNL Narrative Comprehension 51 1.00 
3. WJ-Ill Picture Vocabulary 5 Ol 1.00 
4. WJ-III Letter Word Identification 45 A0 54 1.00 
5. TOWRE Sight Word Efficiency <3 34 A3 .76 1.00 
6. WIAT ORF weighted 40 38 AT eZ 80 1.00 
7. TOSREC 42 43 A8 .66 .69 372 1.00 
8. WJ-III Passage Comprehension 52 46 .63 14 69 67 67 1.00 
9. WIAT Alphabet Writing Fluency 21 25 24 38 39 34 32 31 1.00 
10. WJ-III Spelling 36 .28 42 18 .68 .66 ST 39 3 1.00 
11. Story copying: letters correct 26 25 23 3D 39 40 232 36 43 Al 1.00 
12. SWAN Attention 36 37 2. 43 4 Se 59 Ad .26 AT mae 1.00 . 
13. RAN <6 aed, 18> S48 Ay, 52 — 41 —.43 3 —.46 = 36. ee 


Note. All coefficients are statistically significant at .001 level. OWLS = Oral and Written Language Scale; TNL = Test of Narrative Language; WJ-III = 
Woodcock—Johnson Tests of Achievement (3rd edition); TOWRE = Test of Word Reading Efficiency. WIAT = Wechsler Individual Achievement Test 
(3rd edition); ORF = Oral Reading Fluency; TOSREC = Test of Silent Reading Efficiency and Comprehension; SWAN = Strengths and Weaknesses of 


ADHD Symptoms and Normal Behavior Rating Scale; RAN = Rapid Automatized Naming Test. 


described as an indicator of the writing quality, productivity, CBM 
writing, or as a separate observed variable. 

We hypothesized that the theme and organization score of the 
WIAT composition task would capture the writing quality along 
with the idea development and organization aspects of the adapted 
6 + 1 Trait Rubric because the theme and organization of the 
WIAT task evaluates idea development and structural aspects of 
written composition. CFA confirmed the hypothesis: The model fit 
was good: x7(13) = 72.92, p < .001; CFI = .95; TLI = .92; 
RMSEA = .097; and SRMR = .038. Factor loadings are presented 
in Table 1. Based on preliminary analysis, error covariance was 
allowed between the theme and organization and the 6 + | 
organization score. The CFA model for writing productivity using 
number of words written and number of ideas yielded an excellent 
fit: x7(6) = 34.18, p < .001; CFI = .99; TLI = .98; RMSEA = 
.10; and SRMR = .01. 

To examine the dimensionality of the variables derived from the 
CBM scoring approaches, we fit two CFA models (two latent 
variables in which the CIWS variable is dissociable from %CWS 
vs. one latent variable in which both CIWS and %CWS capture a 
single latent variable), and we compared model fits. The model fit 
for a single dimension was slightly better, Ax* = 5.87, Adf = 1, 
Dp .02. However, the CIWS and %CWS were very highly 
correlated when modeled separately (r = .97). Therefore, it ap- 
peared reasonable to model both the CWIS and %CWS as a single 
CBM latent variable (noted as CBM writing scoring hereafter) in 


Table 5 
Model Fit Indices for Alternative Models 


subsequent analysis. Table 5 shows comparison of CFA model fits 
for alternative models examining whether writing quality, produc- 
tivity, and CBM writing were best considered as three dissociable 
variables or two dissociable variables, or as a single variable. 
Results showed the three-latent-variable model describes the data 
best compared with the other alternative models, Ay* = 201.21, 
ps < .001. 

Next, we examined whether the WJ-III Writing Fluency is best 
described as an indicator of the identified dimensions of writing 
(writing quality, productivity, CBM) or is better described as a 
separable variable. When we fit a model in which the WJ-III 
Writing Fluency task was considered as a separate variable from 
the other three (i.e., writing quality, productivity, and CBM writ- 
ing), the fit was acceptable: y7(151) = 1055.14, p < .001; CFI = 
.90; TLI = .88; RMSEA = .011; SRMR = .08. The WJ-IIl 
Writing Fluency task correlated most strongly with the writing 
quality at .67, followed by .59 with CBM writing, and .46 with 
productivity. When the WJ-III Writing Fluency task was consid- 
ered as an indicator of CBM writing or productivity, the model fits 
were statistically significantly worse (ps < .001). When a CFA 
model was fit in which the WJ-III Writing Fluency was considered 
as an indicator of writing quality, the model fit was not different 
from the separate dimension model, that is, the four-factor model; 
Ax, Adf = 2 = 5.82, p = .054. Therefore, based on these results 
and for parsimony, the WJ—III Writing Fluency task is considered 
as an indicator of writing quality. 








Model No. and description x’ (df) CFI TLI RMSEA SRMR_ Comparison to Model 1: Ax”, Adf (p) 
1. Three latent variables (quality, productivity, CBM) 1061.00 (153) —.90 88 ll .083 
2. Two latent variables (quality + CBM, productivity) 1262.21 (155) 88 85 aie .098 201.21, 1 (p <.001) 
3. Two latent variables (productivity + CBM, quality) 1702.65 (155) _—.83 79 14 12 641.65, 1 (vp <.001) 
4. Two latent variables (quality + productivity, CBM) 1344.67 (155) 87 84 13 .106 283.67, 1 (vp <.001) 
5. One latent variable (quality + productivity + CBM) 1727.22(156)  .83 19 14 21 666.22, 2 (p <.001) 


Note. CFI = comparative fit index; TLI = Tucker—-Lewis index; RMSEA = root-mean-square error of approximation; SRMR = standardized 
root-mean-square residuals; CBM = curriculum-based measurement. 
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In summary, CFA analysis revealed the following three dimen- 
sions for the writing outcomes: writing quality, writing productiv- 
ity, and CBM writing. The writing quality dimension was strongly 
related to CBM writing at .82 and to writing productivity at .75. 
Writing productivity and CBM writing were moderately correlated 
at 54, 


Language and Cognitive Predictors of Writing 
Quality, Writing Productivity, and CBM Writing 


Factor scores of the three writing dimensions (writing quality, 
productivity, and CBM writing) from CFA results were extracted 
from Mplus (SDs = 1.83, 25.74, and 28.74, for writing quality, 
productivity, and CBM, respectively; means are 0), and these three 
dimensions were used in subsequent multilevel! modeling with 
SAS 9.3. In addition, latent variables were created for the predic- 
tors with multiple measures (i.c., oral language and reading) using 
CPA. Factor loadings were high (see Table 2), and model fits were 
excellent (not shown). Then, factor scores of these language and 
reading latent variables were used in the multilevel models. 

First, unconditional models without any predictors were fit for 
the three writing outcomes to parse out amount of variance attrib- 
utable to individuals, classrooms, and schools. Intraclass correla- 
tions were as follows: (a) writing quality, .16 at the school level 
and .05 at the classroom level; (b) writing productivity, .07 at the 
school level, but 0 at the classroom Jevel; and (c) CBM writing, .16 
at the school level and .15 at the classroom level. In other words, 
approximately 16% of the total variance in writing quality, 7% in 
writing productivity, and 16% in CBM writing were due to school 
differences, whereas approximately 5% of the total variance in 
writing quality, 0% in writing productivity, and 15% were due to 


Table 6 


differences among classrooms. In the subsequent analysis, a three- 
Jevel model (school, classroom, and individual) was carried out for 
the writing quality and CBM outcomes, whereas a two-level model 
(school and individual) was constructed for the writing productiv- 
ity outcome because of lack of variance at the classroom level in 
writing productivity. 

We then fit models (M1) to examine unique correlates of writing 
quality, writing productivity, and CBM writing (Research Ques- 
tion 2). As shown in Tables 6 and 7, for writing quality, all the 
language and cognitive predictors were statistically significant 
after accounting for children’s age: children’s oral language (p = 
004), reading (p < .001), spelling (p < .001), letter writing 
automaticity (p = .048), story copying (p < .001), RAN (p = 
005), and attention (p = .03). After accounting for all these 
variables, no variance remained at the classroom and school levels. 
For the writing productivity, individual differences in reading (p = 
002) and timed tasks such as letter writing automaticity (p = 
.004), story copying (p < .001), and RAN (p < .001) were related, 
whereas oral language, spelling, and attention were not (ps = .24). 
Finally, for the CBM writing scoring outcome, children’s reading 
(p < .001), spelling (p < .001), story copying (p < .001), and 
attention (p = .02) remained statistically significant, whereas oral 
language, letter writing automaticity, and rapid automatized nam- 
ing did not (ps = .43). 


Gender and Writing 


To address the third research question of gender gap, first, we 
included children’s gender as the main predictor in addition to 
the age control variable for each writing outcome. This allowed 
us to see whether gender differences were found after account- 


Results of Multilevel Models: Writing Quality and Writing Productivity Predicted by Students’ Language and Literacy Skills, 


Attention, and Gender 


mn 


Writing quality 





Writing productivity 





Variable M1 M2 M3 Mi M2 M3 
a eg ee ee SS ee eS Se 
Fixed effects 

Intercept —2,35 (0.83)""* 0.34 (1.14) —2,24 (0.81) 19.42 (16.28) —3.11 (15.40) 21.71 (15.88) 
_ Age in months —0.04 (0.08) —0,02 (0.14) —0.02 (0.08) —2,52 (1.59) 0.96 (1.86) —2.06 (1.55) 
“Male NA ~0,71 (0.15)*" —0.41 (0.10)°* NA ~11.80 (2.22)"* —8.70 (1.93)°* 
_ Reading 0.10 (0.02)"* 0.10 (0.02)*** 0.95 (0.002)* 0.91 (0.30)* 
Oral language 0.03 (0.01)** 0.03 (0.009) —0.15 (0.18) —0.11 (0.17) 
WI-III spelling 0.05 (0.01)°* 0.06 (0.01)*** 0.52 (0.18) —0.13 (0.24) 
WIAT letter writing 0.02 (0.01)* 0.02 (0.009) 0.52 (0.18)”* 0.55 (0.17)* 
Story copying 0.03 (0.004)°** 0.03 (0.004)°** 0.54 (0.08)"** 0.51 (0.08)"* 
SWAN attention 0.01 (0.005) 0.008 (0.006) 0.13 (0.11) —0.03 (0.11) 
RAN —0.02 (0.008)* —0.02 (0.008)** —0.73 (0.15)"** —0.75 (0.15)°* 
Variance components 5 
- School 0 0.57 0 17.42 43.16 15.65 
Classroom 0 0.25 0 NA NA NA 
~~ Children 1.07 242 1.03 366.44 580.09 349.77 
- =2LL 12.06.2 1929.9 1190.8 3641.8 4568.1 3621.8 
AIC 1226.2 1941.7 1212.8 3663.8 4578.1 3645.8 


De acme ety ih eS Se eS SE 
Note. Standard deviations are in parentheses. M1 = Model 1 (examines the relation of language and cognitive skills); M2 = Model 2 (examines the 
relation of gender and writing); M3 = Model 3 (examines the relation of gender and writing after accounting for language and cognitive skills); NA = 

not applicable; WJ-III = Woodcock—Johnson Tests of Achievement (3rd edition); WIAT = Wechsler Individual Achievement Test; SWAN = Strengths 

‘and Weaknesses of ADHD Symptoms and Normal Behavior Rating Scale; RAN = Rapid Automatized Naming Test; —2LL = log-likelihood; AIC = 


' Akaike information criterion. 
“p< 05S. “p< 01. “p< 001. 
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Table 7 
Results of Multilevel Models: Curriculum-Based Measurement (CBM) Writing Scoring Predicted 
by Students’ Language and Literacy Skills, Attention, and Gender 

CBM writing 


eS ee ——— 


M1 M2 M3 


Va a eee SS 


Fixed effects 


Intercept —68.45 (13.91)"™ 14.35 (17.89) — 66.39 (13.74) 
Age in months —0.67 (1.33) —1.40 (2.15) =0:35 (1:32) 
Male NA — 10.67 (2.36)"™* —6.35 (1.70)"™* 
Reading 1.48 (0.27)"™* 1.45 (0.26)""* 
Oral language 0.12 (0.16) 0.16 (0.15) 
W/J- III spelling —0.04 (0.16) 1.88 (0.21)*** 
WIAT letter writing 1.81 (0.22)"** —0.02 (0.15) 
Story copying 0.26 (0.07)*** 0.24 (0.07)"*" 
SWAN attention 0.30 (0.09)"** 0.23 (0.10)* 
RAN —0.03 (0.13) —0.04 (0.13) 
Variance components ‘ 
School Sal 150.82 6.09 
Classroom 0 129.65 0 
Children 284.10 528.60 274.05. 
=2i 3528.5 4648.2 3514.8 
AIC 3550.5 4660.2 3539.6 


Note. M1 = Model 1 (examines the relation of language and cognitive skills); M2 = Model 2 (examines the 
relation of gender and writing); M3 = Model 3 (examines the relation of gender and writing after accounting 
for language and cognitive skills); NA = not applicable; WJ-III = Woodcock—Johnson Tests of Achievement 
(3rd edition); WIAT = Wechsler Individual Achievement Test; RAN = Rapid Automatized Naming Test; 


—2LL = log-likelihood; AIC = Akaike information criterion. 


Sp <u0See op-=e001. 


ing for age, and if so, how large the gaps were before including 
any potential explanatory variables. As shown in the second 
models in Tables 6 and 7, in all the writing outcomes, boys had 
statistically significantly lower scores after accounting for age. 
In writing quality, boys scored, on average, 0.39 standard 
deviation lower than girls. In writing productivity, boys’ score 
was, on average, lower than girls by 0.46 standard deviation, 
and in CBM writing, boys’ score was 0.37 standard deviations 
lower than girls. 

Language and cognitive variables were then included in the 
models to investigate whether gender differences in writing score 
persisted or disappeared after controlling for these language and 
cognitive variables. Results in Tables 6 and 7 (M3) show that boys 
continued to have lower mean scores in writing even after account- 
ing for all the included language and cognitive variables. However, 
the effect sizes were reduced by approximately quarter to a third 
compared with those in the initial models: the effect sizes were .22 
in writing quality, .34 in writing productivity, and .22 in CBM 
writing. In other words, the included language and literacy predic- 
tors explained the gender gap in writing outcomes to some extent, 
but the relation between gender and writing was not completely 
mediated by the included language and cognitive skills. It is of 
note that the relation of language and cognitive skills to the three 
writing outcomes essentially remained the same between M1 
(before controlling for gender) and M3 (after accounting for 
gender). However, an exception was found for attention, which 
was no longer related to writing quality once gender was taken 
into consideration in addition to language and cognitive skills 
and age. 


Discussion 


In the present study, we investigated the dimensionality of 
writing, predictors of writing, and gender differences, using a large 
data set from second and third grade students in the United States. 
Findings showed that writing quality, writing productivity, and 
CBM writing (CIWS and %CWS) were dissociable dimensions, at 
least for children in Grades 2 and 3. Furthermore, unique predic- 
tors of each dimension differed. 

In conjunction with previous studies (Kim, Al Otaiba, et al., 
2014; Puranik et al., 2008; Wagner et al., 2011), the present 
findings suggest that writing is not a single dimension but is 
composed of multiple dimensions. Theoretically, the writing qual- 
ity and productivity dimensions describe skills that are hypothe- 
sized to be products of two key components in writing, namely, 
ideation and transcription skills (Juel et al., 1986). Idea develop- 
ment and organization aspects, theme and organization scores in 
WIAT, and the WJ-III Writing Fluency task all captured the 
writing quality dimension, whereas number of words written and 
number of ideas captured the writing productivity dimension. 
These findings confirm previous studies about the dissociability of 
writing quality and productivity (Kim, Al Otaiba, et al., 2014; see 
also Puranik et al., 2008, and Wagner et al., 2011), but extend our 
understanding by demonstrating that the theme and organization 
score of WIAT and the sentence level WJ-II Writing Fluency 
tasks capture writing quality. It is interesting that the WJ-II 
Writing Fluency task was more strongly related to writing quality 
than to writing productivity or CBM writing and was best de- 
scribed as an indicator of writing quality. This result suggests that 
the accuracy and rate at which children can construct sentences is 
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likely to be an indicator of writing quality but not writing produc- 
tivity or CBM writing, at least at this stage of writing development. 
It might be that the WJ-III Writing Fluency task captures effi- 
ciency of children’s transcription skills and sentence production 
skills (an oral language skill), both of which are important for 
written composition. It is plausible that this efficiency enables 
children to focus on higher order processes such as idea expression 
and organization. 

Although CBM writing measures and coding methods have 
been examined for reliability and validity (see Graham et al., 2011; 
McMaster & Espin, 2007, for a review), the nature of their theo- 
retical construct and dimensionality has been nebulous. In the 
present study, CBM writing scores captured a dissociable dimen- 
sion from writing quality and productivity and was strongly asso- 
ciated with the quality of writing at .82 and moderately associated 
with writing productivity at .54. It should be noted that in the 
present study, we included two scoring tools that are unique to 
CBM, CIWS and %CWS, although CBM writing scores also 
include other indicators such as total number of words written. 
This latter variable was conceptualized as a productivity indicator 
in the present study according to previous studies and findings 
(e.g., Abbott & Berninger, 1993; Graham et al., 1997; Kim, Al 
Otaiba, et al., 2014; Wagner et al., 2011). Whether the separate 
CBM writing dimension in the present study should be conceptu- 
alized as a global outcome measure of children’s writing skill or 
writing fluency, or as another construct is beyond the scope of the 
present study. As noted earlier, CBM writing was recently theo- 
rized as writing fluency, which was defined as the ease to generate 
written text. According to automaticity and information processing 
theories (e.g., LaBerge & Samuels, 1974; Posner & Snyder, 1975), 
fluency (or automaticity) is required so that cognitive resources 
such as attention and working memory can be used for higher 
order cognitive resources. Applying this to writing development, 
efficiency in generating ideas and transcribing those ideas into 
written texts would allow a writer to focus on aspects such as 
presenting ideas in an organized, clear, and rich manner to enhance 
writing quality. The two CBM writing variables used in the present 
study (CIWS and %CWS) appear to operationalize writing fluency 
well because both capture not just the amount of writing but 
efficiency (accuracy and amount). In addition, CIWS and %CWS 
tend to have highest validity evidence (e.g., Amato & Watkins, 
2011; McMaster & Espin, 2007). One way to validate CBM 
writing measures (at least CIWS and %CWS) as indicators of 
writing fluency is to examine how data fit this theoretical hypoth- 
esis. Specifically, Ritchey et al. (in press) hypothesized that writ- 
ing fluency includes text generation and transcription, which is 
aligned well with the simple view of writing (Juel et al., 1986) and 
the not-so-simple view of writing (Berninger & Winn, 2006). 
Therefore, text generation and transcription skills are component 
skills of writing fluency (i.e., CBM writing), which then would 
predict the criterion measure of writing such as writing quality. In 
other words, the CBM writing measures should mediate, at least 
partially, the relations of the text generation and transcription to 
the criterion measure of writing. Effort is under way to investigate 
this hypothesis by the current research team. 

Another piece of evidence about multiplicity of writing dimen- 
sions comes from differential relations of language and cognitive 
skills to the three dimensions. In the model after accounting for 
gender (Models 3), whereas reading, letter writing fluency, and 


rapid automatized naming were related to both writing quality and 
productivity, oral language and spelling were related only to writ- 
ing quality, but not to writing productivity. In addition, attention 
was related to the CBM writing outcome over and above the other 
variables in the model. Interestingly, although CIWS and %CWS 
do take into consideration grammatical accuracy, oral language 
skill did not uniquely influence the CBM writing. It is notable that 
reading was a consistent predictor for all three dimensions, under- 
scoring the importance of early reading skill in early writing, even 
after accounting for other variables in the model. These results add 
to the increasing evidence of the relation between reading and 
writing, particularly in the elementary years (Berninger et al., 
2002; Kim, Al Otaiba, et al., 2013, in press; Shanahan, 2006; 
Shanahan & Lomax, 1986). Reading has been hypothesized to play 
a role during the process of self-monitoring during planning and 
revision as children have to assess and plan for revisions (Hayes, 
1996; McCuthen, Francis, & Kerr, 1997. Additionally, reading 
skills might contribute to the quality of writing by way of reading 
experiences—better readers read more, and greater amount of 
reading might help children with idea generation from increased 
background knowledge and better organization of ideas (Berninger 
et al., 2006). 

Transcription skills also tended to be consistently related to the 
writing outcomes. Spelling skill was related to writing quality and 
CBM writing, and letter writing automaticity was related to writing 
quality and writing productivity. These findings confirm previous 
studies about the role of transcription skills in writing, as they are 
needed not only to encode ideas into written language but also to 
allow cognitive resources to be used for higher order writing 
processes (Abbott & Berninger, 1993; Berninger et al., 1997; 
Graham et al., 1997). It is noteworthy that compared with the letter 
writing task, the story copying task was related to all three of the 
writing outcomes after accounting for the other variables in the 
model, suggesting that story copying captures processes beyond 
those captured by the alphabet letter writing task. Story copying 
may involve a greater extent of processing capacity (e.g., working 
memory) to hold and process words and sentences as it is a 
discourse-level text, whereas a letter writing task is simply re- 
trieval of letters from memory. Future studies are needed to rep- 
licate the results and any potential sources of differences between 
letter writing and story copying tasks. 

Attention was another cognitive skill that was hypothesized to 
be important for writing (Berninger & Winn, 2006), and it was 
related to writing quality and CBM writing in the present study, 
confirming previous findings for children in first grade and in 
kindergarten (Kent et al., in press; Kim, Al Otaiba, et al., 2013). 
Interestingly, once children’s gender was accounted for, attentive- 
ness was not related to writing quality although its relation re- 
mained for the CBM writing outcome. These results suggest that 
gender may mediate the relation between attention and writing 
quality. Previous studies did not include gender as a covariate in 
examining the role of attention in writing. Future studies are 
needed to investigate the precise role of attention in writing de- 
velopment including reasons why attention matters for CBM writ- 
ing. This is important for typically developing students but also for 
students with ADHD, as boys are more commonly diagnosed than 
girls (Arcia & Conners, 1998; Levy, Hay, Bennett, & McStephen, 
2005). 
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RAN was weakly to moderately related to various writing scores 
in bivariate correlations. Once other language and literacy skills 
were accounted for, RAN was independently related to writing 
quality and productivity, and to our knowledge, this was the first 
study to examine this relation in English. On the one hand, our 
findings converge with two previous studies in another orthogra- 
phy, Chinese (Chan et al., 2006; Ding et al., 2010). On the other 
hand, however, they are discrepant from those of a third study with 
Chinese children in which RAN was not related to writing once 
transcription skills were accounted for (Yan et al., 2012). If RAN 
captures mostly automaticity of letter retrieval, then its influence 
should be largely shared with handwriting automaticity tasks such 
as letter writing automaticity and story copying. Given its relations 
to writing quality and productivity, it appears that RAN captures 
processes beyond handwriting fluency. According to the multi- 
component account of RAN (Wolf & Bowers, 1999; Wolf & 
Denkla, 2005), RAN includes processes for visual, orthographic, 
and verbal processing, and this integration process might be a 
factor that drives the independent relation of RAN to writing 
quality and productivity over and above the other language and 
literacy skills. 

These results of multiple dimensions and associated predictors 
offer important implications for instruction and assessment prac- 
tices. Instructionally, teachers may target different aspects and 
skills to ensure student progress on all areas of writing. In addition, 
if data suggest that a child has weaknesses that may impact a 
particular writing dimension, teachers may target skills in that area 
of interest. For instance, if the teacher is mostly interested in 
improving children’s writing quality, the teacher may want to 
model and introduce strategies to help students focus on the 
development and organization of ideas and expressing generated 
ideas with appropriate language. Also it is worthy to note that to 
improve writing quality, instructional attention is needed in mul- 
tiple aspects such as oral language, reading, transcription skills, 
RAN, and attention, given that the quality of writing was predicted 
by the wide array of language and literacy and cognitive assess- 
ments. If the teacher is particularly concerned about the children’s 
productivity, the teacher may focus more on transcription-related 
skills, given their roles in writing productivity. The teacher could 
target spelling, sentence writing fluency, or other related transcrip- 
tion skills. 

Furthermore, if the teacher’s primary goal is progress monitor- 
ing in writing, the CBM writing scores appear most appropriate for 
two reasons. First, although CBM and writing quality appear to be 
separable dimensions, CBM writing scores give a general idea 
about writing quality, given a strong relation between writing 
quality and CBM writing scores (r = .82). Second, CBM writing 
scores have been shown to be reliable and sensitive to growth 
captured within a relatively short span of time, which is important 
due to frequent assessments (e.g., 2 weeks; Graham et al., 2011; 
Lembke et al., 2003; McMaster & Espin, 2007; McMaster et al., 
2009, 2011). In contrast, writing quality may be less appropriate 
for frequent assessments because writing quality indicators, which 
are typically evaluated on a rating scale, are not likely to be as 
sensitive as CBM writing measures in capturing changes during a 
short period. This speculation, however, requires a future study. 

Finally, confirming previous studies (Berninger & Fuller, 1992; 
Knudson, 1995; National Center for Education Statistics, 2011), 
boys in the present study performed more poorly in all the three 


writing dimensions, with effect sizes ranging from .37 to .46. 
Results further showed that gender differences were explained by 
the included language and cognitive skills as the effect sizes in 
gender differences were reduced by approximately quarter to a 
third when these variables were taken into account. In other words, 
the language and cognitive variables included in the present study 
partially explained writing performance differences between boys 
and girls. On the other hand, these results indicate that gender 
differences persisted in all of the three writing outcomes even after 
accounting for these language and cognitive skills. These findings 
indicate that studies are needed to expand the understanding of 
potential causes of gender gaps in writing. Given findings of a 
previous study that even in Grade 1, boys engage in less writing 
(Graham et al., 2007), it would be informative to investigate how 
attitude toward writing together with language and literacy vari- 
ables explains gender gaps in writing, and whether attitude is 
malleable. Additionally, other potential sources of gender gaps 
(e.g., persistence in writing; McKenna, Kear, & Ellsworth, 1995) 
need to be investigated in future studies. 


Limitations and Conclusion 


One of the limitations of the present study is that many children 
in the present study came from low-income family backgrounds 
from one midsized city in the Southeast In addition, the children 
were primarily African Americans and Whites, with virtually no 
English-language learners. Although their writing performance 
was in the average range in standardized and normed writing 
assessments, future research is needed to determine whether sim- 
ilar results are found for children from different socioeconomic 
and linguistic backgrounds. Further understanding is also required 
regarding the CBM writing scoring dimension. Many studies have 
shown technical adequacy and the utility of CBM in screening and 
progress monitoring of elementary grade children’s writing. Re- 
cent efforts in theoretical conceptualization (e.g., McMaster & 
Espin, 2007; Ritchey et al., in press) are in the right direction to 
help the field gain better understanding of this dimension of 
writing. Finally, there are other types of evaluative approaches to 
written compositions and predictors of writing skills that were not 
included in the present study. For instance, text elements (e.g., 
presence of text structural elements such as topic sentence and 
supporting details; Kulikowich, Mason & Brown, 2008; Wagner et 
al., 2011) and spelling and writing conventions (e.g., punctuation 
and handwriting) were not examined in the present study. In 
addition, motivational, discourse knowledge, and cognitive factors 
(e.g., strategic writing) have been shown to be related to writing 
skills (e.g., Bruning & Horn, 2000; Graham et al., 2005; Hidi & 
Boscolo, 2006; Limpo & Alves, 2013; Pajares, 2003; Olinghouse 
& Graham, 2009) but were not examined in the present study. 

Overall, the findings of the present study suggest that writing 
quality, writing productivity, and CBM writing (composed of 
CIWS and %CWS) are separate dimensions for children in Grades 
2 and 3 and that the relations of language and literacy variables 
differed for various writing outcomes. In addition, gender differ- 
ences persist even after accounting for language and cognitive 
skills. Future research is needed to replicate the present study and 
to further expand researchers’ understanding about skills that 
influence children’s writing development. 
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Both word reading accuracy and word reading fluency are 
critical for successful comprehension (Ehri, 1997; Gough & Tun- 
mer, 1986; Hoover & Gough, 1990; LaBerge & Samuels, 1974; 
Perfetti & Hogaboam, 1975; Stanovich, 1991). Accurate word 
recognition leads to correct lexical activation (Perfetti, 1985; 
Stanovich, 1991), whereas fluent word recognition is critical for 
rapid processing of orthographic information. Together, accurate 
and efficient lexical access allows for greater capacity for higher 
level processing such as constructing meaning at the sentence, 
paragraph, and discourse levels of a text (Breznitz, 2006; Perfetti, 
1985; Swanson & Berninger, 1995; Stanovich, 1994). 

Research has demonstrated that for bilingual children, many 
skills developed in their first language (L1) are positively related 
to reading acquisition in their second language (L2; e.g., 
Durgunoglu, 2002; Genesee, Geva, Dressler, & Kamil, 2006; 
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Gottardo, 2002). The present study focused on cross-language 
transfer of word reading accuracy and word reading fluency in 
bilingual children. The transfer patterns of these two constructs, 
especially those of word reading fluency, have not been system- 
atically examined. Notably, there is little agreement in the litera- 
ture as to what constitutes transfer (Koda, 2007). Traditionally, 
transfer is defined as the use of linguistic (and cognitive) knowl- 
edge acquired in L1 for L2 learning (Odlin, 1989). This type of 
transfer hinges upon the linguistic distance between L1 and L2 
(Koda, 2007). When the L1 and L2 are closely related, shared 
structural properties pose similar demands on processing and allow 
L1 competencies to function in L2 reading with little adjustment. 
By contrast, L1 skills do not facilitate L2 reading to the same 
extent when the two languages are distantly related. Recently, 
transfer has also been conceptualized as the ability to develop L2 
proficiency by drawing on previously acquired resources (Genesee 
et al., 2006; Koda, 2007). Although these resources, such as 
phonological awareness and verbal working memory, are first 
acquired in bilingual children’s L1, they are important for devel- 
oping reading skills in any language and are thereby considered 
script universal (Abu-Rabia & Siegel, 2002; Da Fontoura & Sie- 
gel, 1995; Genesee & Geva, 2006). 

Our study presents a unique opportunity to distinguish between 
the two types of cross-language transfer. Unlike most previous 
transfer studies that examined bilinguals only from a single lan- 
guage background, this study simultaneously involved two groups 
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of bilinguals, Spanish-English bilinguals and Chinese-English 
bilinguals. We compared cross-language transfer of two aspects of 
reading at the word level: word reading accuracy and word reading 
fluency. The L1s of the two groups, Spanish and Chinese, have 
many contrasting features in terms of how the oral language is 
represented by the script, as well as in terms of how much overlap 
it has with English, the children’s L2. Specifically, Spanish and 
English share the same alphabetic script and the use of grapheme- 
to-phoneme correspondences in reading, while Chinese is a logo- 
graphic script that emphasizes the mapping between characters and 
morphemes. Thus, comparing transfer patterns in these two bilin- 
gual groups can elucidate whether and how L1 characteristics 
influence patterns of transfer and reveal script-universal and script- 
specific processes in cross-language transfer. 


Word Reading Accuracy 


Word reading accuracy refers to the ability to accurately identify 
single words from print. Critically important for reading compre- 
hension, it is a skill that all children must master in the early 
grades. Word reading accuracy is typically assessed by asking 
children to read printed words out loud. A fundamental and uni- 
versal aspect of reading is that every orthography encodes the 
spoken language using abstract symbols. Reading a word accu- 
rately in any language requires linkages between orthographic, 
phonological, and meaning representations (Seidenberg & McClel- 
land, 1989). However, the way that orthography maps onto pho- 
nology and morphology differs across languages and influences 
the reading process (Perfetti, 2003; Ziegler & Goswami, 2005). 

Both English and Spanish are alphabetic languages based on the 
Roman script. However, the two scripts differ with respect to the 
transparency of grapheme-to-phoneme correspondences. English 
is considered a deep orthography because it has many-to-many 
sound-to-symbol correspondences (e.g., ch in chord, chore, chute, 
ea in heal vs. healthy) (Geva & Siegel, 2000; Gholamain & Geva, 
1999; Venezky, 1970). Spanish, on the other hand, is a shallow 
orthography with near perfect grapheme-to-phoneme correspon- 
dences (Jimenez-Gonzélez, 1997). Despite the differences in 
grapheme—phoneme correspondences, beginning readers of both 
English and Spanish rely heavily on phonologically based skills, 
such as phonological awareness and decoding, to read words 
(Durgunoglu, Nagy, & Hancin-Bhatt, 1993; Lindsey, Manis, & 
Bailey, 2003). In fact, research has shown that phonological skills 
are related to reading ability and disability for children who learn 
to read Spanish or English, as well as other alphabetic languages 
(Ball & Blachman 1991; Shaywitz & Shaywitz, 2005; Stanovich, 
1988; Torgesen, Wagner, Rashotte, Rose, et al., 1999). 

The Chinese orthography is logographic in nature, with each 
Chinese character corresponding to both a morpheme and a sylla- 
ble in the spoken language. For example, the character & means 
soldier and is pronounced /bing1/. The majority of Chinese char- 

_acters are phonetic compound characters. Each compound contains 
a phonetic component, which may or may not provide useful 
information about the pronunciation of the character. For example, 
34 /qing1/ is a regular character in which the phonetic B /qing1/ 
has the same pronunciation as the compound character; {8 /qian4/ 
is an irregular character in which the pronunciation of the phonetic 
& is misleading. By the end of Grade 6, Chinese children are 
required to learn about 2,500 characters, and 72% of them are 


phonetic compounds (Shu, Chen, Anderson, Wu, & Xuan, 2003). 
The phonetic component only accurately represents character pro- 
nunciation in 39% of the compounds (Shu et al., 2003). 

There are two ways to read Chinese characters. A character 
without an internal structure, such as &, can only be read by 
mapping the whole character to its pronunciation. A compound 
character can be read either by mapping the whole character to its 
pronunciation or by naming the phonetic, although the latter is not 
always reliable. Thus, reading Chinese requires mapping ortho- 
graphic patterns to sound, which is different from the decoding 
strategy used for reading alphabetic languages. Due to the char- 
acteristics of the Chinese orthography, although phonological 
skills are related to reading in young Chinese children (e.g., 
McBride-Chang & Ho, 2005), other metalinguistic skills such as 
morphological awareness and orthographic processing account for 
more variance in Chinese reading than phonological skills (e.g., 
Keung & Ho, 2009; Leong, Tse, Lon, & Hau, 2008; Liao, Geor- 
giou, & Parrila, 2008; McBride-Chang et al., 2005; Tan & Perfetti, 
1998; Tong & McBride-Chang, 2010; Yeung et al., 2011). There- 
fore, based on script characteristics, somewhat different processes 
are involved in reading the L1s of Spanish-English and Chinese— 
English bilinguals. 

A growing number of studies have provided evidence for cross- 
language transfer of word reading accuracy between two alpha- 
betic scripts (Gholamain & Geva 1999; Gottardo, 2002; Manis, 
Lindsey, & Bailey, 2004; Paez & Rinaldi, 2006). For example, in 
a large-scale longitudinal study, Lindsey et al. (2003) found that 
for Spanish-speaking English language learners, Spanish phono- 
logical awareness and Spanish word reading accuracy measured in 
kindergarten were both predictive of English word identification in 
Grade 1. Based on their review of previous research, Dressler and 
Kamil (2006) concluded that word reading skills transfer across 
two alphabetic languages whether they are structurally close (e.g., 
Spanish-English) or distant (e.g., Arabic—English). They also 
pointed out that the transfer occurs in L2 readers with a wide range 
of ages and proficiency levels. 

Studies of cross-language transfer of word reading skills be- 
tween Chinese and English, however, have produced mixed re- 
sults. Gottardo, Yan, Siegel, and Wade-Woolley (2001) found that 
in Chinese-English bilinguals (mean age of approximately 10 
years old), word reading skills in Chinese and English were not 
significantly correlated with one another, although phonological 
skills were correlated across the two languages. Similar findings 
were reported in Wang, Perfetti, and Liu (2005) for Chinese— 
English bilinguals in Grades 2 and 3. The findings of these two 
studies, together with those involving Spanish-English bilinguals, 
seem to suggest that transfer of word reading accuracy is con- 
strained by similarities between bilingual children’s L1 and L2. 
However, in another study, Keung and Ho (2009) observed a 
significant correlation between Chinese and English word reading 
skills among Grade 2 children in Hong Kong. 

For Chinese-English readers, the difference in the findings may 
be attributed to the educational context. The participants of both 
Gottardo et al. (2001) and Wang et al. (2005) were Chinese— 
English bilinguals in North America, where English is the societal 
language and the language of instruction. North American children 
are likely to use phonological and decoding strategies to read 
English because phonics instruction is a major component in early 
reading programs. Cantonese was the medium of instruction for 
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the Hong Kong children in Keung and Ho (2009). These children 
were exposed to English mainly in English language classes, 
where phonics instruction was not provided. Thus, it is plausible 
that the participants of Keung and Ho (2009) applied L1 strategies 
(i.e., whole word activation) to reading English, which strength- 
ened the cross-language connection between English and Chinese 
word reading. Given the inconsistency, more research is needed to 
investigate the nature of cross-language transfer of reading skills in 
Chinese-English bilinguals. 


Word Reading Fluency 


Reading fluency is “the oral translation of text with speed and 
accuracy” (Fuchs, Fuchs, Hosp, & Jenkins, 2001, p. 239). In 
beginning readers who are not yet able to read connected text, 
reading fluency is often assessed by the rate and accuracy of 
reading lists of isolated words or pseudowords aloud (Good, Sim- 
mons, & Kame’enui, 2001; Jenkins, Fuchs, van den Broek, Espin, 
& Deno, 2003). A typical word reading fluency measure instructs 
children to read aloud as many words or pseudowords as possible 
within a specified time (Torgesen, Wagner, & Rashotte, 1999). 
Research has shown that word reading fluency is an effective 
screening measure for determining at-risk status among beginning 
readers (Clemens, Shapiro, & Thoemmes, 2011; Compton et al., 
2010; Compton, Fuchs, Fuchs, & Bryant, 2006; Fuchs, Fuchs, & 
Compton, 2004). Furthermore, word reading fluency measured at 
an early age is a strong predictor of later text reading fluency 
(Geva & Farnia, 2012; Good et al., 2001; Jenkins et al., 2003) and 
reading comprehension (Aaron, Joshi, & Williams, 1999; Fuchs, 
Fuchs, & Maxwell, 1988; Joshi & Aaron, 2000; Marston, 1989; 
National Institute of Child Health and Human Development, 
2000). Good et al. (2001) showed that for children in Grade 1, 
word reading fluency measured with pseudowords in the first 
semester was highly correlated with text reading fluency as well as 
reading performance in the second semester. Geva and Farnia 
(2012) reported that word and text reading fluency loaded on a 
single factor for both English L1 and L2 students in Grade 2. Thus, 
there is a high degree of overlap between word and text reading 
fluency in the early grades, and word reading fluency may form the 
foundation for developing text reading fluency. In the present 
study, we chose to measure reading fluency with isolated words 
because our participants were bilingual children in Grade 1 at the 
beginning of the study and would have had difficulty reading 
connected text. 

Word reading fluency is characterized by rapid, automatic word 
recognition (Kuhn, Schwanenflugel, & Meisinger, 2010; Samuels, 
2006). According to Kuhn et al. (2010), automaticity possesses 
four features: speed, effortlessness, autonomy, and lack of con- 
scious awareness. A fluent reader reads by retrieving words di- 
rectly from long-term memory, as opposed to by phonological 
recoding, which is slower and more laborious. In addition, rapid 
word recognition occurs without intention (i.e., autonomy) or 
conscious awareness of the component skills (e.g., phonological 
awareness, morphological awareness, orthographic processing) re- 
quired for reading. Reading comprehension demands substantial 
cognitive resources. It is a challenging task for beginning readers 
because they must allocate a large amount of cognitive resources 
to word recognition. When word recognition becomes more fluent, 
readers can reallocate resources to text processing and achieve 


better comprehension (Fuchs et al., 2004; Perfetti, 1985; Stanov- 
ich, 1980, 1994; Swanson & Berninger, 1995). 

Althqugh learning to read English and Chinese requires some- 
what different component skills, in both orthographies, fluent 
reading is developed by creating strong links among orthographic, 
phonological, and semantic patterns following repeated exposures 
to print (Clemens et al., 2011; Kuhn et al., 2010; Seidenberg, 2005; 
Seidenberg & McClelland, 1989; Shu & Anderson, 1999). There is 
evidence that similar processes are involved in reading Spanish 
sight words even though it is a more transparent orthography 
(Defior, Cary, & Martos, 2002; Wimmer & Goswami, 1994). 
Additional evidence for the use of similar processes in fluent 
reading across languages is the existence of the word superiority 
effect in alphabetic languages as well as in Chinese (Mattingly & 
Xu, 1993; Morton, 1969; Reicher, 1969). Across languages, fluent 
readers are better at tasks involving wofds than strings of letters. 
Breznitz (2003, 2006; Breznitz & Berman, 2003) posited that rapid 
word reading results from successful and efficient synchronization 
and integration of phonological, orthographic, and semantic infor- 
mation. Because each of these components exists to some extent in 
written script, it is reasonable to assume that this process is similar 
across Spanish, Chinese, and English. 

To our knowledge, only one previous study examined transfer of 
reading fluency. In a 1-year longitudinal study, De Ramirez and 
Shapiro (2007) investigated the relationship between text reading 
fluency in Spanish (L1) and English (L2) among bilingual students 
in Grades 1-5. They found that Spanish fluency in the fall was 
correlated with English fluency in the spring. This result provides 
preliminary evidence for cross-language transfer of text reading 
fluency. However, the cross-language associations may have been 
due to spurious third variables because within-language measures 
known to be related to reading fluency, such as rapid naming and 
phonological awareness, were not controlled for in the analysis. In 
the present study, we extended previous research by including 
multiple within-language controls (e.g., nonverbal reasoning, pho- 
nological awareness, rapid naming, and an autoregressor of word 
reading fluency). In addition, we focused on word reading fluency 
as little is known about how this construct is related across bilin- 
gual children’s L1 and L2. 


The Present Study 


The present study examined cross-language transfer of word 
reading accuracy and fluency in Spanish-English and Chinese— 
English bilinguals. With respect to word reading accuracy, we 
predicted that cross-language transfer would vary as a function of 
similarities between the bilingual children’s L1 and L2. A stronger 
crossover effect would be observed between English and Spanish 
than between English and Chinese because English and Spanish 
are both alphabetic scripts that require grapheme-to-phoneme cor- 
respondences in reading. Whereas Spanish-speaking students bring 
to English reading alphabetic decoding abilities developed in their 
LI, the same cannot be said of Chinese-speaking students, as they 
use different strategies in L1 reading. With respect to word reading 
fluency, we predicted that cross-language transfer would occur for 
both Spanish-English and Chinese-English bilinguals. This pre- 
diction stems from the script-universal perspective of cross- 
language transfer. Considering that reading in any language re- 
quires efficient synchronization of phonological, orthographic, and 
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semantic information, reading fluency may be a script-universal 
construct that is related across bilingual children’s L1 and L2 
regardless of the degree of overlap between them. 

The current study adopted a cross-lagged design. Participants 
were assessed twice, in Grades 1 and 2, respectively. Cross-lagged 
designs are more rigorous than cross-sectional correlational de- 
signs as they can account for the effect of an autoregressor, which 
is the outcome variable measured at an earlier time point. Without 
accounting for the autoregressor, relationships among the other 
predictors can be artificially inflated (Kenny, 1975). Additionally, 
we controlled for nonverbal reasoning, within-language phonolog- 
ical awareness, and rapid automatized naming (RAN) in our cross- 
language analysis. For instance, when English word reading accu- 
racy in Grade 2 was the dependent variable, nonverbal reasoning, 
English phonological awareness, English RAN, and English word 
reading accuracy in Grade 1 were entered in the model before 
Grade 2 L1 word reading accuracy, the target predictor. Control- 
ling for the autoregressor and multiple within-language variables 
greatly reduced the possibility that any cross-language transfer 
observed could be due to within-language relationships or other 
spurious third variables. 


Method 


Participants | 


Data collection occurred at two time points spaced 1 year apart, 
spring of Grade 1| and spring of Grade 2. In Grade 1, participants 
included 61 Spanish-English bilingual children and 68 Chinese— 
English bilingual children. One year later, 51 Spanish-English 
(mean age = 80.90 months, 30 girls) children and 64 Chinese— 
English (mean age = 81.43 months, 36 girls) remained in the 
study. Attrition occurred because some children moved away. Data 
were analyzed only for the students who participated in the study 
at both time points. The Chinese-English bilingual sample con- 
sisted of both Cantonese and Mandarin speakers. Of the 64 
Chinese-English bilinguals who were used for data analysis, there 
were 49 Cantonese speakers and 15 Mandarin speakers. Because 
the Cantonese and Mandarin speakers performed similarly on most 
measures, the two subgroups were collapsed in data analysis to 
increase power. The only group difference between the Mandarin 
and Cantonese speakers was on phonological awareness in Chi- 
nese in Grade 1, with the Mandarin speakers performing signifi- 
cantly better on this measure. 

The children were recruited from 19 schools in predominantly 
middle-class neighborhoods in two Canadian cities. English was 
the language of instruction in all schools. Approximately 70% of 
the Spanish-English bilinguals and 90% of the Chinese-English 
attended heritage language classes for 2.5 hr per week. Within 
Spanish-English homes, 30% of parents reported speaking only 
in Spanish. The other 70% reported conversations occurring in 
both Spanish and English. Within the Chinese-English homes, 
35% of the sample reported speaking only in Chinese, with the 
other 65% reporting conversations occurring in both Chinese and 
English. The mean level of parental education was high school for 
both groups. Approximately 60% of the parents had a high school 
education or less, and 25% had a university education, whereas the 
other 15% did not report their level of education. 


Measures 


A battery of tests including nonverbal reasoning, phonological 
awareness, RAN of digits, word reading accuracy, and word read- 
ing fluency was given to each participant in Grade 1. All measures 
except for nonverbal reasoning were given in both English and the 
participant’s L1. One year later, the same word reading accuracy 
and fluency measures were given in English and the participant’s 
L1 in Grade 2. 

Nonverbal reasoning. The Matrix Analogies Reasoning Test 
(Naglieri, 1985) was administered as a measure of nonverbal 
reasoning. This test involved presenting 64 pictures organized into 
four subtests of 16 items each. All participants attempted each of 
the four subtests. For each item, the child was shown a pattern with 
one portion missing and was asked to choose among six options 
the one that correctly completed the pattern. A stopping rule of 
four consecutive incorrect items was used for each subtest. The 
Cronbach’s alpha for children aged 6 to 7 years old is .94 based on 
the testing manual (Naglieri, 1985). 

Phonological awareness. English phonological awareness 
was assessed by an experimental deletion test (Jared, Cormier, 
Levy, & Wade-Woolley, 2011). The test consisted of three subtests 
assessing syllable, onset-rime, and phoneme deletion, respectively. 
The subtests were administered in that order, with five practice 
items and 12 test items in each subtest. The maximum score of the 
test was 36. In the syllable deletion subtest, for example, the child 
was asked to delete the first or second syllable from a multisyllable 
word: “Say bam-daw, now say what is left of bam-daw if you don’t 
say daw.” If the child scored two or fewer items correctly in a 
subtest, subsequent levels were not administered. The Cronbach’s 
alpha calculated for our study was .86. 

Spanish phonological awareness was measured by the Elision 
subtest of the Test of Phonological Processing in Spanish (TOPPS; 
Francis et al., 2001). The test contained three practice items and 20 
test items (three syllable deletion items and 17 phoneme deletion 
items). The test was stopped when three consecutive incorrect 
responses were made. A parallel measure was used for Chinese— 
English bilinguals in Cantonese (Gottardo et al., 2001) and adapted 
for Mandarin. The Chinese measure contained the same numbers 
of practice and test items and followed the same procedure as the 
Spanish measure. The Cronbach’s alpha calculated for our study 
was .88 for the Spanish measure and .95 for the Chinese measure. 

Rapid Automatized Naming (RAN)-Digits. Rapid naming 
was assessed by the RAN—Digits subtest from the Comprehensive 
Test of Phonological Processing (Wagner, Torgesen, & Rashotte, 
1999) in English and the RAN-Digits subtest of the TOPPS 
(Francis et al., 2001). A parallel experimental measure was 
adapted to Chinese. The same testing stimuli were used, but 
children were requested to respond either in Mandarin or in Can- 
tonese. In all three measures, the child was presented with six rows 
of the same six digits arranged in different orders and was asked to 
name the 36 digits as quickly and accurately as possible. Two 
practice examples were given prior to testing to ensure that the 
child understood the instructions and was able to name the digits. 
Two alternative forms (A and B) were completed, and the time it 
took to name all the digits in seconds was recorded and used as the 
raw score in the analyses. The test-retest reliability for children 
aged 5 to 7 was .91 for the English measure according to the test 
manual (Wagner et al., 1999). 
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Word reading accuracy. Word reading accuracy was as- 
sessed by the Word Identification subtest of the Woodcock Read- 
ing Mastery Test—Revised (Woodcock, 1991) in English and by 
the Identificacion de letras y palabras subtest of Woodcock Lan- 
guage Proficiency Battery—Revised, Spanish Form (Woodcock & 
Munoz-Sandoval, 1995) in Spanish. Both tests required the child 
to read words that increased in length and difficulty and had a 
stopping rule of six consecutive words read incorrectly. According 
to the manuals, the Cronbach’s alpha was .92 for the English 
measure and .95 for the Spanish measure (Woodcock, 1991). 

Word reading accuracy in Chinese was measured by a test 
previously used in Gottardo et al. (2001). The same set of 100 
characters was used for both Cantonese and Mandarin speakers. 
The Cantonese items consisted of traditional characters, whereas 
the Mandarin items were in the simplified form. The items were 
presented in order of increasing difficulty. Children were encour- 
aged to attempt all characters and were allowed to skip items or 
guess when they were unsure. The Cronbach’s alpha calculated for 
this sample was .95. 

Word reading fluency. Word reading fluency was assessed 
by the real-word subtest of the Test of Word Reading Efficiency 
(TOWRE) in English and the TOWRE words subtest from the 
TOPPS in Spanish (Francis et al., 2001). For both tests, raw scores 
were the number of items read correctly from a list of words in 45 
s. According to the manuals, the test-retest reliability for ages 6 
and 7 years was .97 (Torgesen, Wagner, & Rashotte, 1999) for the 
English version and .94 for the Spanish version. A parallel exper- 
imental measure following the same procedure was administered 
in Chinese. The test consisted of 104 items that were presented in 
order of increasing difficulty. Traditional characters were used for 
the Cantonese version, whereas simplified characters were used for 
the Mandarin version. The Cronbach’s alpha calculated for this 
sample was .90. 


Comparability of the Word Reading Measures 


Comparability of measures used across languages was a central 
and critical aspect of the present study. It was necessary to estab- 
lish that all parallel tasks were valid and reliable and that they 
measured equivalent constructs across languages. The English and 
Spanish tasks were taken from standardized assessment packages 
with established validity and high rates of reliability (Francis et al., 
2001; Torgesen, Wagner, & Rashotte, 1999; Woodcock, 1991; 
Woodcock & Munoz-Sandoval, 1995). Since no standardized mea- 
sures existed for Chinese, experimental measures were used in the 
present study. The Chinese measures were specifically designed to 
be analogous in the construction, instructional protocol, and diffi- 
culty level of their English and Spanish counterparts. Research 
based on the Chinese measures has been reported in many studies, 
and similar patterns of results occurred across these studies, sug- 
gesting that the measures were valid and had a high degree of 
reliability (see Gottardo et al., 2001; Leong, Cheng, & Mulcahy, 
1987; Marinova-Todd, Zhao, & Bernhardt, 2010; Wang et al., 
2005). Thus, the formats of English, Spanish, and Chinese mea- 
sures were parallel across languages, and appropriately assessed 
the intended constructs. 

As an additional check to ensure that the word reading accuracy 
and word reading fluency measures were comparable across the 
three languages, we calculated the mean frequency for each mea- 


sure in each language. We used the SUBTLEX databases in 
English (Brysbaert & New, 2009), Spanish (Cuetos, Glez-Nosti, 
Barbon, & Brysbaert, 2011), and Chinese (Cai & Brysbaert, 2010) 
in our calculation, with the log frequency of a word’s occurrence 
per million as the unit of analysis. The mean frequencies for the 
word reading accuracy measures were 3.42, 3.50, and 3.60 in 
English, Spanish, and Chinese, respectively. The mean frequencies 
for the word reading fluency measures were 3.05, 3.25, and 3.23 in 
English, Spanish, and Chinese, respectively. Univariate analysis of 
variance revealed no significant differences for either the word 
reading accuracy measures (p = .429) or the word reading fluency 
measures (p = .575), suggesting the levels of difficulty were 
similar across languages. 


Procedure 


All children were tested individually by trained research assis- 
tants in a quiet room during the school day. The children were 
tested in English and in their L1 on different days, with the order 
of L1 and L2 testing counterbalanced. The Spanish and Chinese 
measures were administered by research assistants who were na- 
tive speakers of the respective languages. Both English and L1 
instructions were used for the L1 measures to ensure that children 
understood the tasks. Only English instructions were used for the 
English measures. On average, it took the students approximately 
1 hr to complete the measures in English and their L1. 


Results 


All measures were checked for normality, skewness, and kur- 
tosis. Some measures were positively skewed (e.g., English and L1 
RAN in Grade 1 for both groups, English and L1 word reading 
fluency in Grade 1 for both groups). For each of these measures, 
square root or logarithm transformations were performed to correct 
the distribution to normal. However, subsequent analyses con- 
ducted with transformed scores produced results virtually identical 
to those with untransformed scores. Therefore, the results with the 
untransformed raw scores are presented for all analyses. 

Table 1 presents the mean raw scores, standard deviations, 
maximum scores, skewness, and kurtosis statistics for all measures 
for the Spanish-English and Chinese-English bilinguals. We con- 
ducted ¢ tests on the English measures across groups. The two 
groups of bilinguals performed similarly on English phonological 
awareness and English RAN. The Chinese-English bilinguals out- 
performed the Spanish-English bilinguals on the English word 
reading accuracy and fluency measures in Grade 1; however, these 
differences became negligible by Grade 2. The results suggest that 
the Chinese-English and Spanish-English bilinguals had similar 
levels of English proficiency in Grade 2. A Pearson correlation 
matrix for all the variables is displayed by language group in Table 
2. To protect against Type I error, only results significant at p < 
.O1 were interpreted as meaningful. Overall, most variables were 
significantly correlated. Notably, scores on word reading fluency 
were significantly correlated across languages in both grades for 
both groups of participants with moderate to high correlations. For 
the Spanish-English bilinguals, scores on word reading accuracy 
were significantly correlated across languages in both grades. For 
the Chinese-English bilinguals, only Chinese word reading accu- 
racy in Grade 1 was significantly correlated with English word 
reading accuracy in Grade 1. 
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Table 1 
Descriptive Statistics of Chinese-English and Spanish-English Bilinguals 
Chinese-English bilinguals Spanish-English bilinguals 
: ; Skewness Kurtosis Skewness Kurtosis 
Variable Maximum M SD (SE = 30) (SE=.59) Maximum M SD (SE = 33) (SE = .66) 
MAT 58 30.48 12.14 0.28 Sn 43 24.51 8.55 —0.18 0.02 
LI RAN Gl 228 70.61 43.57 2.43 4.33 613 111.10 109.80 3.10 10.36 
L1 PA G1 20 9.58 6.59 0.13 On 18 6.53 3.63 1.58 S27, 
Ll WRA G1 43 8.48 6.65 2.58 10.84 49 14.63 9.05 1.62 3.94 
L1 WRA G2 51 10.05 8.77 2.80 9.09 48 De, 10.97 0.16 —0.46 
Ll WRF G1 20 6.83 5.24 1.04 0.14 43 9.39 8.29 2.04 5.76 
L1 WRF G2 28 8.66 6.09 1.36 MSY? oS 18.47 10.38 1.17 2.26 
E RAN G1 136 51.2 19.26 2.24 6.43 168 55.02 20.05 3.66 20.03 
E PAGI 36 21.06 8.41 —0.72 0.11 33 23.08 6.37 eil05 2.02 
E WRA Gl 75 43.08 16.9 = 0:50 —0.24 93 32.78 18.48 0.72 SH 
E WRA G2 78 55.06 13.97 —0.84 0.33 81 203i] 12.44 a OL 5.45 
E WRF G1 71 41.25 18.88 0129 — 1.08 80 26.73 1Sel 1.04 Dea 
E WRF G2 77 54.97 14.73 —1.04 0.70 88 49.37 15.04 —0.40 1.32 





Note. MAT = Matrix Analogies Reasoning; RAN = Rapid Automatized Naming; PA = phonological awareness; WRA = word reading accuracy; 
WRE = word reading fluency; L1 = first language; E = English; G1 = Grade 1; G2 = Grade 2. 


Regression and Commonality Analyses 


Analytic strategy. Based on the key research questions and 
the results of the correlational analyses, hierarchical regression 
analyses were conducted to identify significant cross-language 
relationships, for word reading accuracy and fluency in the 
Chinese-English and Spanish-English bilinguals, beyond within- 
language control variables. We first carried out four regression 
analyses on the combined sample. Grade 2 word reading accuracy 
or fluency in L1 or English acted as the dependent variable, 
respectively. For each regression analysis, language group (Span- 
ish or Chinese) was entered in Step 1, followed by the main effect 
and interaction of the different independent variables and language 
group. Specifically, nonverbal reasoning and the interaction of this 
variable with language group were entered in Step 2. Within- 
language phonological awareness and the interaction of this vari- 
able with language group were entered in Step 3. Within-language 


Table 2 


RAN and the interaction of RAN with language group were 
entered in Step 4. Step 5 included the autoregressor effect of Grade 
1 word reading accuracy or fluency (congruent with the dependent 
variable) on Grade 2 word reading and the interaction of the 
autoregressor by language group. In Step 6, cross-language word 
reading accuracy or fluency in Grade 2 (congruent with the de- 
pendent variable) was added to test cross-language transfer of the 
construct, together with the interaction of language group and the 
cross-language variable. 

Notably, our cross-language relationships were concurrent as 
we used Grade 2 cross-language variables to predict Grade 2 
reading outcomes. This decision was made because the chil- 
dren’s reading skills were more developed in Grade 2 and, 
consequently, the reading measures administered in Grade 2 
captured more variance than the reading measures administered 
in Grade 1. To save space, the regression tables for the joint 


Pearson Correlations Among All Variables for the Chinese-English Bilinguals (Above the Diagonal) and the Spanish-English 


Bilinguals (Below the Diagonal) 





Variable 1 D 3 4 5 6 7 8 9 10 11 12 15 

1. MAT — -—.03 ‘330 iin nell 18 10 oon Agree .48"** qe" 41" 36% 

2. L1 RAN Gl —.04 — .08 ESS eo Sai LO Oils ly OS lO =. 05) aly =D) 

3. L1 PA Gl LS ie, —— = 02 ali .10 .04 oe 64°" D0 juve .42*™* 44°"" Bile 

4. L1 WRA G1 1 Often Ok ‘Sikes — [Gaia OO wie {line af 13 34™ oe eee Oa) 

5. L1 WRA G2 (OI AA AO 650.” — puns ee =.03 aS) .19 23 oar 

6. L1 WRF Gl Pie Oe OSt meee Ose. | co 0iar — Ut emaieee oD Nd oF ou Asi 34 41" 

7. L1 WRF G2 109 28) LOO Ores 18a — 9% 18 eo Ses Boy AS 

8. E RAN Gl 26) 22 A) Seo Oe Oa SA — ee On eA nr Olle meer OO ima 

9. E PA Gl Ae 09 AS" 08 21 .26 BOUT 68m — ae Sita We Tile 
10. E WRA Gl Ne lS. 65 Poa 31° Bot een Sarat. EASE —_ .90"™" .94*** 8385" 
11. E WRA G2 eT O20 SOF 46""aiea51""" 64°" OMe) | OSs. ‘65° — Sia ‘OI 
12. E WRF Gl ae el 1G1CS* eS" 30" eh 60g te One oo. (922"= 64°" —- COO 
13. E WRF G2 OM AVS FAO hae 3D PO ee 2” 168s 8 RASS {eile Res -66" = — 
Note. MAT = Matrix Analogies Reasoning; RAN = Rapid Automatized Naming; PA = phonological awareness; WRA = word reading accuracy; 


WRE = word reading fluency; L1 = first language; E = English; G1 = Grade 1; G2 = Grade 2. 
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analysis are not included in the article but are available in 
Tables S1-S4 of the online supplemental materials. 

Considering most interaction terms were significant in the re- 
gression models described above, separate regressions were then 
conducted for the two groups. In each separate regression analysis, 
nonverbal reasoning, within-language phonological awareness, 
and within-language RAN were entered in the first three steps. 
Grade 1 word reading accuracy or fluency (congruent with the 
dependent variable) was added in Step 4 to control for the autore- 
gressor effect of Grade 1 reading on Grade 2 reading. Finally, in 
Step 5, cross-language word reading accuracy or fluency in Grade 
2 (congruent with the dependent variable) was added to test 
cross-language transfer of the construct. 

To gain a better understanding of the cross-language relation- 
ships, we conducted commonality analyses separately for the 
Chinese-English and Spanish-English bilinguals to further clarify 
the similarities and differences between groups (Pedhazur, 1997). 
The motivation for conducting a commonality analysis was to 
understand the common and unique contributions of the predictor 
variables to the outcome variables. It is beneficial to supplement a 
regression analysis with a commonality analysis because it ad- 
dresses collinearity in the data. Considering that the variables 
under study were often highly correlated with one another, a 
commonality analysis can establish the relative importance of 
these variables while accounting for the relationships among the 
independent variables (Mood, 1971; Newton & Spurrell, 1967; 
Nimon & Reio, 2011; Pedhazur, 1997; Zientek & Thompson, 
2010). Additionally, a commonality analysis produces both beta 
weights and structural coefficients. Structural coefficients are 
Pearson correlation coefficients between predictor variables and 
the predicted outcome. It is beneficial to examine both beta 
weights and structural coefficients, especially when multicollinear- 
ity may occur, as the influence of a predictor variable on an 
outcome variable may be shared with another variable’s beta 
weight (Courville & Thompson, 2001; Nimon & Reio, 2011; 
Thompson, 2006; Zientek & Thompson, 2010). To facilitate the 
reporting of the results, we present the variance unique to each 
predictor and the sum of all the variance components containing 
the cross-language predictor. The complete commonality analysis 
for each dependent variable and each group is available in Tables 
S5-S6 of the online supplemental materials. 

The regression and commonality analyses are summarized in 
Tables 3 (regression), 4 (regression), 5 (commonality), 6 (tegres- 


Table 3 


sion), 7 (regression), and 8 (commonality). All regressions, 
whether conducted jointly or separately for the two groups, satis- 
fied the assumptions of normality, homogeneity, and indepen- 
dence. Since the beta weights produced by the regression analyses 
and the commonality analyses were identical, they are presented 
only in the regression tables to avoid redundancy. 

Cross-language transfer of word reading accuracy. The left 
panel of Table 3 displays the results of the regression analysis 
predicting Grade 2 English word reading accuracy for the Chinese 
speakers. In this regression analysis, Grade 1 English phonological 
awareness and Grade 1 English word reading accuracy both ex- 
plained unique variance. The right panel of Table 3 presents the 
regression analysis with Grade 2 English word reading accuracy as 
the dependent variable for the Spanish speakers. Grade | English 
RAN and Grade 1 English word reading accuracy were unique 
predictors of Grade 2 English word reading accuracy for Spanish— 
English bilinguals. Cross-language relationships were not signifi- 
cant for either group when predicting English word reading accu- 
racy. Specifically, when entered in the last step, Grade 2 Chinese 
word reading accuracy did not explain unique variance in Grade 2 
English word reading accuracy. Similarly, Grade 2 Spanish word 
reading accuracy did not predict unique variance in Grade 2 
English word reading accuracy when entered last. 

The regression analysis predicting Grade 2 Chinese word read- 
ing accuracy for Chinese-English bilinguals is displayed in the left 
panel of Table 4. Only Grade 1 Chinese word reading accuracy 
was a unique predictor of Grade 2 Chinese word reading accuracy. 
Grade 2 English word reading accuracy did not contribute any 
unique variance to Grade 2 Chinese word reading accuracy. Thus, 
there was no cross-language transfer of word reading accuracy 
between Chinese and English. The right panel of Table 4 presents 
the results of the regression analysis examining Spanish word 
reading accuracy for the Spanish-English bilinguals. Among the 
three Grade 1 Spanish measures, only the autoregressor measure of 
Spanish word reading accuracy uniquely predicted the dependent 
variable. Importantly, Grade 2 English word reading accuracy was 
a unique predictor of Grade 2 Spanish word reading accuracy 
above and beyond all within-language controls. Thus, in the 
Spanish-English bilinguals, there was a significant cross-language 
effect of word reading accuracy from English to Spanish. Signif- 
icant transfer of word reading accuracy for the Spanish-English 
bilinguals, but not the Chinese-English bilinguals, was confirmed 
by a significant interaction of English word reading accuracy with 
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language group when L1 word reading accuracy was the depen- 
dent variable in the joint analysis. 

The left panel of Table 5 presents the structural coefficients, 
proportions of unique and common variance and the corresponding 
percentages of explained variance for the model predicting English 
word reading accuracy. As explained earlier, we only present the 
sum of all the variance components, including the cross-language 
predictor. When English word reading accuracy was the dependent 
variable, structural coefficients of L1 word reading accuracy were 
weak (.21) for the Chinese-English bilinguals and moderate (.56) 
for the Spanish-English bilinguals. Virtually no unique variance 
was explained by Ll word reading accuracy for both groups 
(0.11% for the Chinese-English bilinguals and 3.26% for the 
Spanish-English bilinguals). Additionally, L1 word reading accu- 
racy explained 4.22% of the common variance in English word 
reading accuracy for the Chinese-English bilinguals and 27.72% 
of the common variance for the Spanish-English bilinguals. The 
right panel of Table 5 displays the results predicting L1 word 
reading accuracy. Structural coefficients of English word reading 
accuracy were weak (.27) for the Chinese-English bilinguals and 
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moderate (.61) for the Spanish-English bilinguals. English word 
reading accuracy explained 1.11% of the unique variance for the 
Chinese-English bilinguals and 12.74% of the unique variance for 
the Spanish-English bilinguals. Relatedly, English word reading 
accuracy accounted for a substantial portion of common variance, 
with the other predictors, for the Spanish-English bilinguals 
(24.42%), but not for the Chinese-English bilinguals (6.48%). 
Cross-language transfer of word reading fluency. The left 
panel of Table 6 displays the results of the regression analysis 
predicting Grade 2 English word reading fluency for the Chinese— 
English bilinguals. English phonological awareness, English RAN, 
and English word reading fluency in Grade 1 all contributed 
unique variance to Grade 2 English word reading fluency. Notably, 
Grade 2 Chinese word reading fluency, entered in the last step, was 
also a significant predictor of Grade 2 English word reading 
fluency. The right panel of Table 6 presents the regression model 
predicting Grade 2 English word reading fluency for the Spanish— 
English bilinguals. Grade 1 English word reading fluency was the 
only unique predictor among the three within-language variables. 
Importantly, Grade 2 Spanish word reading fluency contributed 
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space. ” Total represents the total amount of unique and shared variance of all the variables in the model. 
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unique variance to Grade 2 English word reading fluency beyond 
all within-language controls. 

The regression model predicting Grade 2 Chinese word reading 
fluency for the Chinese-English bilinguals is presented in the left 
panel of Table 7. Both Grade 1 Chinese RAN and Grade 1 Chinese 
word reading fluency were unique predictors. Grade 2 English 
word reading fluency was a marginally significant predictor (p = 
.068). The regression analysis with Grade 2 Spanish word reading 
fluency as the dependent variable is displayed in the right panel of 
Table 7. Among the three Grade 1 Spanish variables, only Grade 
1 word reading fluency accounted for unique variance in Grade 2 
Spanish word reading fluency. It is noteworthy that Grade 2 
English word reading fluency was also a unique predictor of Grade 
2 Spanish word reading fluency. The results from the joint analysis 
confirmed the cross-language results. No significant interaction 
was found when English reading fluency was the dependent vari- 
able, suggesting that patterns of transfer were not statistically 
different across groups. When predicting L1 word reading fluency, 
there was a significant interaction of the cross-language predictor 
with language group. Overall, the results of reading fluency across 
languages suggested that for the Chinese—English bilinguals, there 
was a significant cross-language effect from Chinese to English 
and a marginally significant effect from English to Chinese. For 
the Spanish group, there was bidirectional transfer of word reading 
fluency between Spanish and English. 

The left panel of Table 8 presents the structural coefficients, the 
proportions of unique and common variance, and the correspond- 
ing percentages of explained variance for the model predicting 


Table 7 


English word reading fluency. The structural coefficients between 
L1 word reading fluency and English word reading fluency were 
moderate (.49) to strong (.86) for the Chinese-English bilinguals 
and Spanish-English bilinguals, respectively. L1 word reading 
fluency explained 1.91% of the unique variance for the Chinese— 
English bilinguals and 18.16% for the Spanish-English bilinguals. _ 
Notably, despite the difference in the amount of variance ex- 
plained, regression analyses showed that L1 reading fluency was a 
significant unique predictor of English word reading fluency for 
both groups. In addition, L1 word reading fluency explained about 
22.53% of the common variance in the Chinese-English bilinguals 
and 56.34% of the common variance in the Spanish-English 
bilinguals. Finally, as shown in the right panel of Table 8, when L1 
word reading fluency was the dependent variable, structural coef- 
ficients were moderate (.56) to strong (.85) for the Chinese— 
English bilinguals and Spanish-English bilinguals, respectively. 
English word reading fluency uniquely accounted for 3.21% and 
15.75% of explained variance for the Chinese-English bilinguals 
and Spanish-English bilinguals, respectively. Additionally, Eng- 
lish word reading fluency accounted for 27.52% and 55.66% of the 
variance jointly with other variables for the two groups, respec- 
tively. 


Discussion 


The present study examined cross-language transfer of word 
reading accuracy and fluency. The participants of the study were 
Spanish-English and Chinese-English bilinguals, whose first lan- 
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Commonality Analysis for Predicting Grade 2 English and LI Word Reading Fluency 
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“Common variance of E WRF G2 columns represents the sum of all the variance components containing L1 WRF G2. Common variance of L1 WRF G2 
columns represents the sum of all the variance components containing E WRF G2. Common variance of other components is not included here to save 
space. ° Total represents the total amount of unique and shared variance of all the variables in the model. 


guages differ with respect to how the orthography represents the 
oral language. By comparing the transfer patterns between the two 
groups of children, we sought to understand script-specific and 
script-universal processes in cross-language relations for word 
reading accuracy and fluency. With respect to word reading accu- 
racy, we predicted that there would be transfer between Spanish 
and English due to the shared script and the overlap in grapheme- 
to-phoneme correspondences and that there would not be transfer 
between Chinese and English due to script differences. Regarding 
word reading fluency, we predicted that transfer would occur for 
both Spanish-English and Chinese-English bilinguals as the abil- 
ity to synchronize and integrate information from phonological, 
morphological, and orthographic systems should be related across 
the L1 and L2, regardless of the degree of overlap between the two 
languages. 

As expected, we found cross-language relations for word read- 
ing accuracy in Spanish-English bilinguals. Grade 2 English word 
reading accuracy predicted unique variance in Grade 2 Spanish 
word reading accuracy after controlling for nonverbal reasoning, 
Grade 1 Spanish phonological awareness, Grade 1 RAN, and 
Grade 1 Spanish word reading accuracy. Furthermore, common- 
ality analysis showed that Grade 2 English word reading accuracy 
also explained a large amount of variance in Grade 2 Spanish word 
reading accuracy jointly with other predictors. Our results add to a 
growing body of research demonstrating cross-language transfer of 
phonological and word reading skills in Spanish-English bilin- 
guals (August & Shanahan, 2006; Dressler & Kamil, 2006; Ghola- 
main & Geva 1999; Gottardo, 2002; Lindsey et al., 2003; Manis et 
_al., 2004; Paez & Rinaldi, 2006). Since Spanish and English are 
both represented by alphabetic scripts, they require the same 
component skills in reading acquisition. Phonological awareness, 
letter knowledge, and decoding strategies are important for learn- 
ing to read both languages. The association observed between 
English and Spanish word reading accuracy suggests that for 
bilinguals, structural similarities in L1 and L2 facilitate cross- 
language relationships for word reading accuracy. 


Interestingly, the crossover effect of word reading accuracy in 
Spanish-English bilinguals was only found from English to Span- 
ish, not in the other direction. Grade 2 Spanish word reading 
accuracy was not a unique predictor of Grade 2 English word 
reading accuracy, although there was a strong structural coefficient 
and a large amount of shared variance with the other predictors in 
the model. The unidirectional association was likely related to the 
fact that the children received their formal reading instruction in 
English and, as a result, their English reading skills were more 
advanced than their Spanish reading skills. Extant research has 
demonstrated similar patterns of transfer where language and lit- 
eracy skills in the dominant language facilitate acquisition of 
literacy skills in the less proficient language (Deacon, Wade- 
Woolley & Kirby, 2007; Pasquarella, Chen, Lam, Luo, & Ramirez, 
2011; Zhang et al., 2010). Alternately, as suggested by a reviewer, 
the direction of the cross-language association may derive from 
contrasting patterns of regularity in the two languages. Grapheme- 
to-phoneme correspondences are far less regular and therefore 
more difficult to acquire in English than in Spanish. Therefore, 
English may provide a better foundation for transfer than Spanish 
as knowledge of Spanish grapheme-to-phoneme correspondences 
may be too regular to add any value to English reading accuracy. 
Additional research is needed to determine whether these expla- 
nations are complementary or whether one will prevail. Our find- 
ings, combined with those of previous research, suggest that pos- 
sible factors that determine the direction of cross-language 
relations of literacy skills include bilingual students’ relative pro- 
ficiency in their L1 and L2 and the typological characteristics of 
the scripts involved. 

Our study did not find significant relationships between L1 and 
L2 word reading accuracy skills for the Chinese-English bilin- 
guals. The cross-language structural coefficients were weak, and 
the amount of variance shared with other predictors was also small. 
These findings were consistent with the research conducted by 
Gottardo et al. (2001) and Wang et al. (2005). Lack of associations 
between Chinese and English reinforces the notion that transfer of 
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word reading accuracy is a script-specific process that operates on 
similarities between L1 and L2. In addition, there is evidence that 
contextual factors, such as the language learning environment and 
the instructional method, play a role in transfer of reading skills 
across languages. Similar to Gottardo et al. and Wang et al., our 
study was conducted in the North American context. Research that 
reported a significant correlation between Chinese and English 
word reading was conducted by Keung and Ho (2009) in Hong 
Kong. North American children are likely to use phonological 
strategies to read English because phonics instruction is an impor- 
tant component in early reading programs. Children in Hong 
Kong, on the other hand, may be taught using the “Jook-and-say” 
method and therefore may be more inclined to use Chinese strat- 
egies, such as memorizing the whole word, to read English words 
(Keung & Ho, 2009; Shu & Anderson, 1999). It is possible that 
overlap in instructional methods and reading strategies between 
Chinese and English strengthens the cross-language connection of 
word reading skills in children in Hong Kong. The effect of 
educational context on cross-language transfer needs to be system- 
atically investigated by future studies that include comparable 
cross-cultural samples. 

With respect to word reading fluency, bidirectional relationships 
were found for the Spanish-English bilinguals. Grade 2 Spanish 
word reading fluency was a unique predictor of Grade 2 English 
word reading fluency after controlling for nonverbal reasoning and 
Grade 1 English phonological awareness, Grade 1 English RAN, 
and Grade 1 English word reading fluency. Similarly, Grade 2 
English word reading fluency explained unique variance in Grade 
2 Spanish word reading fluency after controlling for nonverbal 
reasoning and the Grade 1 Spanish measures. Additionally, com- 
monality analysis showed that the cross-language structural coef- 
ficients were strong, and the cross-language predictor explained a 
very large amount of variance in word reading fluency jointly with 
other predictors in both models. Thus, our results, obtained with 
stringent within-language controls, confirm and extend the work of 
De Ramirez and Shapiro (2007), who reported significant corre- 
lations between Spanish and English text reading fluency. There- 
fore, word reading fluency is related for Spanish and English, two 
alphabetic orthographies, despite the fact that Spanish has more 
transparent grapheme-to-phoneme correspondences than English. 

Importantly, cross-language transfer of word reading fluency 
was also observed in the Chinese-English bilinguals, whose two 
languages are represented by typologically different orthographies. 
For this group, Grade 2 Chinese word reading fluency predicted 
unique variance in Grade 2 English word reading fluency after 
accounting for nonverbal reasoning and Grade 1 English control 
variables. The prediction from Grade 2 English word reading 
fluency to Grade 2 Chinese word reading fluency was marginally 
significant and displayed a moderate cross-language structural 
coefficient. It is noteworthy that the cross-language predictor ex- 
plained a large amount of variance in word reading fluency to- 
gether with the other predictors in both models. By contrast, there 
was no unique prediction, weak cross-language structural coeffi- 
cients, and only a small amount of shared variance in the models 
predicting word reading accuracy for the Chinese-English bilin- 
guals. 

Since cross-language relationships occurred for word reading 
fluency in both Spanish-English and Chinese-English bilinguals, 
it appears that the mechanism underlying word reading fluency is 


largely script universal and not heavily influenced by differences 
in bilinguals’ L1 and L2. Word reading is based on perceptual, 
phonological, orthographic, and morphological processes, and 
reading fluency is determined by both the speed of processing 
within each system and synchronization and integration of the 
different systems (Breznitz, 2003, 2006; Breznitz & Berman, 
2003; Seidenberg, 2005). Although the different systems function 
in somewhat different ways in Spanish, English, and Chinese, the 
synchronization and integration process may be a universal aspect 
of reading across the three languages. Our findings suggest that 
having well-developed reading fluency in one orthography con- 
tributes to reading fluency in another regardless of L1 and L2 
differences. 

On the other hand, there were differences in the patterns of 
relationships for word reading fluency in the Spanish-English and 
Chinese-English bilinguals. In some instances, the crossover ef- 
fect was stronger in the Spanish-English bilinguals than the 
Chinese-English bilinguals, as indicated by the significant inter- 
action of English word reading fluency and the children’s L1 (i.e., 
Spanish or Chinese) when predicting L1 word reading fluency. 
Commonality analyses also demonstrated stronger cross-language 
structural coefficients and more unique and shared variance be- 
tween Spanish and English than between Chinese and English 
when predicting reading fluency. This result is based on the fact 
that accurate reading of words is necessary for word reading 
fluency. Our sample consisted of young children some of whom 
were fluent readers while others had not reached automaticity in 
word recognition. This variability in the sample resulted in an 
overlap between word reading accuracy and fluency skills, which 
was expected to be higher for the Spanish speakers than for the 
Chinese speakers. In other words, for Spanish-English bilinguals, 
cross-language relations for word reading fluency might be 
strengthened by cross-language relations for word reading accu- 
racy. This facilitation does not occur for Chinese-English bilin- 
guals, as word reading accuracy is not associated across the two 
languages. 

The results of the present study contribute to the theory of 
cross-language transfer. We outlined two different perspectives 
toward cross-language transfer at the beginning of the article. 
According to the script-dependent perspective, cross-language 
transfer of a reading skill is based upon similarities across lan- 
guages—the greater the overlap, the stronger the association of the 
skills across languages. This type of transfer is unlikely to occur if 
two languages do not share common features (Bialystok, Majum- 
der, & Martin, 2003; Koda, 2007; Osgood, 1949; Pasquarella et al., 
2011; Ziegler & Goswami, 2005). Alternatively, the script- 
universal perspective conceptualizes cross-language transfer as 
operating under cognitive and linguistic processes that are impor- 
tant for reading in all languages (Geva & Siegel, 2000). For 
bilingual children, a script-universal process transfers in the sense 
that it is recruited when reading both L1 and L2. This type of 
transfer is not conditioned by overlapping structural features. 

Our results provide strong evidence that both types of cross- 
language transfer occur in bilingual children’s reading develop- 
ment. While transfer of word reading accuracy hinges upon shared 
structures between L1 and L2, transfer of word reading fluency 
occurs regardless of structural similarities. In addition, our results 
point to the possibility that both script-specific and script-universal 
processes underlie the transfer of a single construct due to its 


TRANSFER OF WORD READING ACCURACY AND FLUENCY 107 


complex and multifaceted nature. As previously mentioned, al- 
though word reading fluency is largely a script-universal process, 
it involves a script-specific component for less proficient readers 
because they apply decoding skills to read novel words encoun- 
tered in the fluency test. Our study represents the first step toward 
disentangling the two types of transfer in bilingual children’s 
reading development. Future research should examine the nature 
of transfer for other reading skills, such as morphological aware- 
ness, vocabulary, and reading comprehension. 

The findings of our study have practical implications for the 
assessment and classification of bilingual children. Due to the 
well-known finding that phonological awareness is highly related 
between bilingual children’s L1 and L2 regardless of the typolog- 
ical distance between the two languages, it is commonly agreed 
that there is no need to delay assessment for bilingual children 
until they develop sufficient oral proficiency in L2 (Cline & 
Frederickson, 1999; Geva & Herbert, 2012). Even for bilinguals 
from a nonalphabetic L1 background, assessment of phonological 
awareness can be effectively conducted in L1 for the purpose of 
predicting reading success in English. However, our findings in- 
dicate that unlike phonological awareness, word reading strategies 
are conditioned by script characteristics. Therefore, it cannot be 
assumed that older Chinese immigrant children will use alphabetic 
decoding strategies in reading English words. Instead, they may 
apply word reading strategies that are effective for reading Chi- 
nese, possibly leading to reading problems in English. Thus, ex- 
plicit instruction in decoding strategies specific to English is 
necessary for immigrant children who are already literate in their 
LI, especially for those from a nonalphabetic background such as 
Chinese. 

Another implication lies in using reading fluency as a potential 
cross-language screening measure for bilingual children. Although 
there is substantial evidence that word reading fluency is an 
effective screening measure for children who are native speakers 
of English (e.g., Compton et al., 2006, 2010; Fuchs et al., 2004), 
little is known about the value of using L1 reading fluency as a 
screening measure for bilingual children who are not yet literate in 
their L2. An important finding of the present study is that word 
reading fluency transfers between L1 and L2 in Spanish-English 
bilinguals as well as in Chinese-English bilinguals. In other words, 
L1 reading fluency can predict success in L2 reading fluency and 
is likely related to L2 reading comprehension for bilingual children 
from both alphabetic and nonalphabetic backgrounds (Kim, Wag- 
ner, & Lopez, 2012). Conversely, students with poor L1 fluency 
may be at risk of developing L2 reading problems regardless of 
their L1 backgrounds. Assessing reading fluency in the L1 may be 
especially useful for older immigrant children who are already 
literate in the L1 but are still building oral and reading vocabulary 
in the L2. Such assessment can avoid a further delay in identifi- 
cation of reading difficulties in this population. 

The present study has several limitations. One limitation 
concerns measurement of the same reading and cognitive vari- 
ables across Chinese, Spanish, and English. We used parallel 
standardized measures across languages when they were avail- 
able. Because there were no Chinese standardized measures, we 
created our own experimental measures. All of our Chinese 
measures were highly reliable. They also demonstrated corre- 
lational patterns similar to those observed in previous research 
_(e.g., Gottardo et al., 2001; McBride-Chang & Ho, 2005), 


providing additional evidence of measurement validity. How- 
ever, since our research did not focus on measurement issues, 
establishing validity of the Chinese measures through empirical 
methods goes beyond the scope of the present study. Future 
studies should develop multiple measures of these constructs 
and administer them to a larger number of participants to 
examine validity. Another limitation lies in the comparability of 
our Chinese-English bilinguals and Spanish-English bilin- 
guals. Although the two groups of children were similar in 
many ways, including age, L1 exposure, and parental education, 
there were also some differences, especially in terms of enroll- 
ment in heritage language programs and attrition rates. Because 
these differences were also observed in previous studies com- 
paring Chinese-English bilinguals and Spanish-English bilin- 
guals (e.g., Bialystok et al., 2003; Chen, Ramirez, Luo, Geva, & 
Ku, 2012), we suspect that they represent systematic differences 
that exist between the two populations. Unfortunately, these 
were factors that we could not manipulate and attest to chal- 
lenges in conducting research involving more than one group of 
bilingual children. Furthermore, our Chinese sample included 
both Mandarin- and Cantonese-speaking participants. Future 
research should replicate our findings with Chinese-English 
bilinguals who speak the same L1. Finally, future studies should 
be designed to specifically address factors that influence the 
direction and degree of cross-language transfer. For example, 
comparing the transfer patterns among balanced bilinguals, L1 
dominant children, and L2 dominant children may shed light on 
how cross-language relations are influenced by language dom- 
inance. 

To summarize, the present study provides novel insights into the 
underlying processes of transfer of word reading accuracy and 
word reading fluency in bilingual children. First, our results indi- 
cate that cross-language relations for word reading accuracy are 
conditioned by similarities between bilingual children’s L1 and 
L2. Our study went beyond the previous research by including 
stringent control variables in the regressions and using common- 
ality analyses to dissect unique and shared variance in cross- 
language relations. Second, our study reveals that transfer of word 
reading fluency is largely a script-universal process, similar to 
transfer of phonological awareness and working memory (Geva & 
Ryan, 1993; Geva & Siegel, 2000; Manis et al., 2004). Finally, our 
results suggest that cross-language relations are influenced by 
bilingual readers’ relative proficiency in L1 and L2, as well as by 
contextual factors such as the language learning environment and 
instructional methods. Understanding transfer of word reading 
accuracy and word reading fluency and the factors that influence 
the direction and strength of transfer have important implications 
for the assessment and instruction of bilingual children. 
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There is increasing interest in the role of phonological awareness across languages. Research is 
uncovering cross-language effects of phonological awareness upon English reading, even from nonal- 
phabetic languages. However, little of this research has focused on examining the extent to which 
multiple measures of phonological awareness indicate a single construct within or across languages. This 
article updates 2 recent reviews of the literature by fitting rival a priori models of multiple measures in 
order to test within-language and across-language structure among multiple phonological awareness 
tasks. Although the number and types of languages covered were quite limited, the results demonstrate 
high cross-language correlations, suggesting that measurement error has attenuated prior estimates of the 
cross-language correlation of phonological tasks. The current results suggest that in alphabetic languages, 
there is empirical support for phonological awareness as a unitary ability within English and other 
languages. In Korean and in Spanish, phonological awareness may operate as a language-general 
construct. In Cantonese and Mandarin, the results were less clear. The results also highlight the 
limitations of the current research base and important areas for future investigation. 
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There is growing interest in the cross-language role of pho- 
nological awareness, both as a construct by itself and because of 
its potentially important role in learning to read different lan- 
guages (Branum-Martin, Tao, Garnaat, Bunta, & Francis, 2012; 
Durgunoglu, Nagy, & Hancin-Bhatt, 1993; Melby-Lervag & 
Lervag, 2011). Phonological awareness, or the ability to recog- 
nize and manipulate linguistic sounds apart from their mean- 
ings, is a crucial skill in learning to read (National Institute of 
Child Health & Human Development, 2000; Rayner, Foorman, 
Perfetti, Pesetsky, & Seidenberg, 2001; Snow, Burns, & Griffin, 
1998; Wagner & Torgesen, 1987). Diverse measures of phono- 
logical awareness have high but somewhat inconsistent corre- 
lations across languages, as shown in two recent meta-analyses 
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(Branum-Martin et al., 2012; Melby-Lervag & Lervag, 2011). 
However, method effects, such as using the same task in both 
languages (e.g., blending phonemes), could spuriously inflate 
cross-language correlations, leading researchers to believe that 
phonological awareness operates more similarly than it does in 
actuality. Relations across measures, both within and across 
languages, have implications for what we infer about phono- 
logical awareness: whether it is a single construct or multiple 
constructs, within language as well as across languages. In 
these prior analyses, such trait effects were not adequately 
distinguished from potential method effects. 

We care about whether phonological awareness is a single 
construct because of its implications for instruction and interven- 
tion (Branum-Martin et al., 2006). If phonological awareness is a 
single construct across languages (i.e., is language general), pho- 
nological instruction should improve performance in both lan- 
guages and potentially help reading, perhaps in either language. 
Alternatively, if phonological awareness is a language-specific 
ability— one construct in each language—instruction and interven- 
tions would have to be designed to the special cognitive ‘and 
linguistic requirements of the separate abilities in those languages 
(as opposed to merely the specific sound structure and orthography 
of the particular language). 

The issue of multiple measures of one or more constructs 
represents a classic multitrait, multimethod problem (Campbell & 
Fiske, 1959; Eid, Lischetzke, & Nussbeck, 2006). In this article, 
we reanalyze prior studies of cross-language phonological aware- 
ness to specifically test theoretical models of trait (construct) 
versus method effects. 
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Phonological Awareness Within and Across Languages 


Although there is growing evidence of phonological awareness 
as a single ability across multiple tasks within many languages, 
such as Spanish, Dutch, Greek, Korean, and Chinese (see review 
by Branum-Martin et al., 2012), what facilitating or interfering role 
differing linguistic features may have when learning other lan- 
guages is not clear. The role of phonology could be complex, 
involving multiple, language-specific constructs (abilities) in per- 
sons who speak more than one language (Grosjean & Li, 2013). 
Alternatively, phonology may involve but a single construct, 
which is domain general across languages (see the review by 
Thomas & van Heuven, 2005). 

On a theoretical basis, it is not clear how consistent the effects 
of phonology in different languages should be and how those 
might facilitate or complicate our investigation of phonological 
awareness as a construct. Similarity in phonological, grammatical, 
and lexical features could ease performance of skills across lan- 
guages (Flege, 1995), thereby suggesting that phonological abili- 
ties operate similarly, if not identically, across languages. Addi- 
tionally, computational linguistic models of bilingualism imply 
that the overlap of orthography and phonology can be important in 
reading (Grosjean & Li, 2013; Thomas & van Heuven, 2005). 
However, these models suggest that similar letters or sounds can 
activate as well as inhibit linguistic recognition and production 
from one language to another. Thus, it is possible that cross- 
language influences in phonology could be helpful, be difficult to 
overcome, or have no effect. 

The size of the linguistic units (e.g., syllables, onsets, rimes, 
phonemes) can also be important, especially when different lan- 
guages emphasize different grain sizes (Frost, 1998; Grosjean, 
2008; Grosjean & Li, 2013; Ziegler & Goswami, 2005, 2006). For 
example, alphabetic languages that emphasize phonemes, such as 
English or French, may foster different linguistic skills than do 
languages with more emphasis on larger grain sizes, such as Italian 
or Spanish (Ziegler & Goswami, 2005, 2006). Because the nature 
of phonological skills may differ across languages and in persons 
who speak more than one language, cross-language effects may 
not be simple. 

Linguistic analysis and cognitive experiments suggest that there 
is a basis for expecting that, across languages, phonological abil- 
ities are likely activated by a number of reading tasks, even when 
phonological information is limited or the writing system is not 
alphabetic (Perfetti, 2003; Perfetti, Liu, & Tan, 2005; Perfetti & 
Zhang, 1995; Perfetti, Zhang, & Berent, 1992). It has been argued 
that “phonological processes occur as part of reading in all writing 
systems, with the details of the writing system influencing the 
details of phonological processing” (Perfetti & Zhang, 1995, p. 
24). Writing systems represent spoken language, but they do not 
always do so reliably or consistently. Therefore, different writing 
systems will offer different degrees of facilitation and constraint in 
the representation of the sounds of speech (Flege, 1995; Frost, 
1998; Grosjean, 2008; Grosjean & Li, 2013; Ziegler & Goswami, 
2005, 2006). It is therefore plausible that phonological awareness 
is a single ability across languages, but the role of phonological 
awareness in learning to read may differ across writing systems. 

We therefore have unclear and potentially conflicting theoretical 
expectations. Diverse measures of phonological awareness may 
indicate language-specific abilities (i.e., two or more factors that 


may or may not be correlated), or these measures may indicate a 
single, domain-general ability across languages. Moreover, these 
structures of abilities could possibly differ in different languages 
(e.g., required skill structures may be different in English than they 
are in Chinese). Research must test these hypothetical possibilities 
while accounting for potential method effects that could be com- 
mon in measures across languages. 


Rationale for the Current Study 


Two prior syntheses have examined cross-language correlations 
among phonological measures (Branum-Martin et al., 92012; 
Melby-Lervag & Lervag, 2011). The first cross-language meta- 
analysis of phonological awareness (Melby-Lervag & Lervag, 
2011) examined the impact of instructional language (bilingual or 
only second language), home language (native or not reported), 
writing system (alphabetic or ideographic), and age upon 17 re- 
ported cross-language correlations between phonological aware- 
ness tasks among bilingual children. Among these studies, no 
significant effects were found for these four factors on the cross- 
language correlation between phonological awareness tasks. 

A second meta-analysis substantially expanded upon these re- 
sults to examine 101 correlations from 38 studies in nine languages 
(Branum-Martin et al., 2012). Cross-language correlations were 
found to differ strongly by the particular language but not neces- 
sarily by whether the writing system was alphabetic. There was a 
tendency for older samples of children to have lower cross- 
language correlations. Effects for linguistic features of the tasks 
were neither strong nor consistent (Branum-Martin et al., 2012). 
Together, these two studies suggest that cross-language relations 
among phonological awareness tasks are usually moderate to high: 
0.39 to 0.86 (Branum-Martin et al., 2012; Melby-Lervag & Le- 
rvag, 2011). 

The moderate to high correlations found in prior meta-analyses 
could simply reflect shared method variance between similar tasks: 
High cross-language correlation may be spurious. Although 
Branum-Martin et al. (2012) found no substantial difference in the 
cross-language correlation due to different types of task, only 
measures that were the same in each language were used. 

The threat of method covariance across languages may obscure 
our attempts to understand a hypothesized ability of phonological 
awareness, across or even within languages. With multiple mea- 
sures and different task types or methods, we can formulate rival 
structural models of trait versus method effects as confirmatory 
factor models (Eid et al., 2006). Indeed, the wide array of tasks can 
be distracting, unless researchers were to adopt a latent variable 
perspective in which each task may potentially be an indicator of 
an unobserved ability or factor. If the hypothesized ability, such as 
general phonological awareness, caused performance on the ob- 
served tasks, those tasks should relate to each other in a homoge- 
neous manner. Although tasks differ within each language, their 
similarities may suggest a general ability of phonological aware- 
ness in each of these languages. An illustration of the specific 
implications of such a confirmatory factor model is given in 
Appendix A. 

Such tests of tasks in multiple languages can provide evidence 
of discriminant validity. Two studies have tested such cross- 
language models of phonological awareness with confirmatory 
methods. Branum-Martin et al. (2006) found that at the student 
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level, after controlling for classroom differences, phonological 
awareness factors in English and Spanish were Statistically sepa- 
rable but highly correlated (r = .93). In a study of Cantonese and 
Mandarin, Chen, Ku, Koyama, Anderson, and Li (2008) found that 
onset and rime tasks fit a single-factor model across the two 
languages, but tone awareness did not. With high cross-language 
correlations, these two studies illustrate the empirical possibility of 
a confirmatory model of phonological awareness as a single con- 
struct (i.e., factor) across languages. 

Overall, across the prior reported studies in the meta-analyses, 
issues of measurement error and structure among these tasks may 
have been overlooked, leaving a number of questions insufficiently 
addressed. For example, is high correlation within or across lan- 
guage only due to similarity of task or method? Is low correlation 
an idiosyncrasy of sample, design, or measure? Does low corre- 
lation imply something fundamentally wrong with our notions of 
how phonological awareness operates in students who speak more 
than one language? 

The current study seeks to answer these questions via confir- 
matory models fit to the multiple phonological awareness tasks in 
all of the studies that reported a full correlation matrix for more 
than two measures of phonological awareness. We investigated the 
following three research questions, with the details of these models 
given in the following section: 


1. If phonological awareness is a single ability in each 
language, what does that model suggest about the cross- 
language correlation of phonological awareness con- 
structs versus cross-language method correlations? 


2. To what extent do multiple measures of phonological 
awareness suggest that it is a single ability across lan- 
guages, in the presence of method correlations? 


3. What do the current models imply with respect to the 
prior meta-analyses? 


Method 


Selection and Recording of the Studies 


This study analyzes data from the studies included in two recent 
meta-analyses (Branum-Martin et al., 2012; Melby-Lervag & Le- 
rvag, 2011). All of the 38 studies reported by Branum-Martin et al. 
(2012) were examined for whether the study reported more than 
two measures in a language in a full matrix (i.e., the cross- 
language correlation for each measure as well as the cross-measure 
and cross-language correlations). Nineteen of those prior studies 
reported a full correlation matrix on three or more measures. Three 
studies had previously been excluded because the measures were 
not the same across languages (Bursztyn, 1998; Wang, Cheng, & 

-Chen, 2006; Wang, Yang, & Cheng, 2009). The current analysis 
therefore included 22 studies that reported on 25 samples. These 
studies are listed in the Appendix B with their sample character- 
istics and measures. 

For each study, the correlation matrix was recorded along with 
the means and standard deviations. For example, if a study re- 
ported five measures in each language, for a total of 10 measures, 
then the full 10 by 10 correlation matrix was recorded, and such a 


study is referred to as a5 X 5. Instead, if the study only had one 
measure in English and three measures in the other language, 
the 4 by 4 matrix was recorded. Such a design is called a 1 X 
3 in the current study (see Appendix B). These matrices were 
used in the latent variable analysis. 


Models 


We used structural equation models fit to correlations with 
means and standard deviations (Bollen, 1989; Kline, 2005; Mac- 
Callum, Wegener, Uchino, & Fabrigar, 1993), using the sample 
size reported in each study. With the given correlations, means, 
and standard deviations, the full covariance matrix can be used to 
test alternative, confirmatory structural models, as noted in Ap- 
pendix A (MacCallum et al., 1993). We used Mplus 7 (Muthén & 
Muthén, 2012) for the models fit to each study in this article. 
Issues of sample size and fit will be discussed. 

Research Question 1 posits that measures indicate one latent 
factor in each language, with method correlations for measures 
involving the same task in both languages (see Appendix A). A 
4 X 4 study therefore would have two correlated factors, each with 
four indicators. Research Question 2 was tested by all the measures 
indicating only a single factor. Studies that used a different number 
of outcomes, such as 3 X 2, were fit as appropriately reduced 
versions of these models. Method covariances were included only 
for observed indicators that were designed to be the same across 
languages. A factor model of this sort represents the extent to 
which the given indicators represent the intended construct (fac- 
tor), with potential residual method covariances. These structures 
will be presented graphically in the Results section. Research 
Question 3 was evaluated graphically by summarizing the model 
results along with estimates from the prior meta-analysis for each 
language. 


Results 


The two models were fit to each sample within each study (25 
samples from 22 studies), using the sample size reported by the 
authors (see Appendix B). The two-factor and single-factor tests of 
the research questions are each reported in turn. 


Two-Factor Models: Language-Specific Phonological 
Awareness (Question 1) 


Table 1 shows results for the two-factor models in the 25 
samples. The chi-square, comparative fit index (CFI), root-mean- 
square error of approximation (RMSEA, with 90% confidence 
interval), and standardized root mean residual (SRMR) are re- 
ported for each study. Solely for visual reference, superscripts are 
used to highlight model indices with a substantial lack of fit 
(CFI > .90, RMSEA < .06, and SRMR < .08). We do not adhere 
blindly to these model fit guidelines but discuss sources of misfit 
and interpretational problems (Marsh, Hau, & Grayson, 2005). 
Table 1 also contains the estimated latent cross-language correla- 
tion between factors of phonological awareness in English and the 
other language. 

Of the 25 models, nine did not obtain interpretable results, 
indicated by dashes in Table 1. Modeis with a lack of fit in CFI, 
RMSEA, or SRMR are marked with a superscript. In the two 
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Table 1 
Fit Statistics for Two-Factor Models (Latent PA in Each Language) 





Language and study (sample) Study notes x” (df) D ‘CFT RMSEA 90% CI SRMR _ Latent correlation 
Greek 
Loizou & Stuart (2003; Greek) 1 —_— oe —— = a aoe pez 
Loizou & Stuart (2003; British) 2 — aad = = ae ma ci 
Spanish 
Atwill et al. (2007) oe Pel (GO) 10 99 alles [.00, .40] 02 0.81 
Atwiill et al. (2010) a 3.8 (1) 05 .98 16% [.00, .33] 02 0.76 
Branum-Martin et al. (2006) 3 11.9 (5) 04 1.00 04 [.01, .07] O01 0.93 
Cisero & Royer (1995, Experiment 1) — 3.8 (5) 58 1.00 .00 [.00, .20] .03 0.89 
Cisero & Royer (1995, Experiment 2, Time 1) it _— = = cami 7 — a = 
Cisero & Royer (1995, Experiment 2, Time 2) — 23.8 (5) <.01 93 20s A227 .06 0.77 
Leafstedt & Gerber (2005) — 30.6 (15) 01 .90 mille [.05, .16] .08 0.98 
Gottardo & Mueller (2009) — 25.3 (12) O01 94 10" [.04, .15] .06* 0.66 
Bursztyn (1998) 4 — — — 
Korean ‘ 
Kim (2009) — 0.4 (2) eS 200) .00 [.00, .20] O01 0.89 
Wang, Park, & Lee (2006) — 4.5 (5) 48  ~=1.00 .00 [.00, .20] .04 0.94 
Cho & McBride-Chang (2005) 5,6 9.9 (7) 19 .98 AOE [.00, .16] .04 0.93 
Cantonese 
Luk (2003) — 2.4 (6) 88 1.00 .00 [.00, .11] .03 0.94 
Gottardo et al. (2001) 7 1.8 (3) <6) 200 .00 [.00, .17] 202 0.92 
Gottardo et al. (2006) 1,8 — _- — — — — — 
McBride-Chang et al. (2006) ees — — — — — — — 
Luk & Bialystok (2008) 1 ae is ell ae aE = wit 
Mandarin 
Wang, Cheng, & Chen (2006) a 0.3 (2) .86 © 1.00 .00 [.00, .13] .02 0.37 
Wang et al. (2005) 4 — — = a as ae aS 
Wang et al. (2009) — 1.4 (2) 49 ~=1.00 .00 [.00, .20] .03 0.06 
Yan et al. (2005) 10 13.7 (5) .02 887 16 [.06, .27] .07 0.55 
Xu & Dong (2005) 18.9 (16) 2 OO 02 [.00, .06] 02 0.94 


—_— | 


Tao et al. (2007) 


a 


Note. Dashes indicate that the model did not yield interpretable results. Study note refers to an explanatory description of the matrix or model used, if 
needed: (1) The latent correlation estimated >1.0. (2) The model would not converge, with some modifications suggesting that method correlations were 
comparable to the English loadings. (3) The student-level matrix from a two-level confirmatory factor model was analyzed. (4) The cross-language and 
within-language correlations were inconsistent such that no model would converge. (5) The reported correlation matrix was residualized for age in the 
original article. (6) One-year lag: Korean in kindergarten, English in Grade 1. (7) Sample data consisted of weighted means and standard deviations from 
four age groups. (8) The reported correlation matrix was residualized for age and level of education in the original article. (9) Covariances were based on 
weighted means and standard deviations from two age groups. CFI = comparative fit index; RMSEA = root-mean-square error of approximation, with 
90% confidence interval (CI); SRMR = standardized root mean residual. The latent correlation is the estimated correlation between the English 
phonological awareness (PA) factor and the other language PA factor. 

2 Indicates a lack of model fit (CFI < .90, RMSEA > .06, SRMR > .08). 


Greek samples (Loizou & Stuart, 2003), the two-factor model 
would not converge, indicating that the reported correlations were 


trouble had latent correlations greater than 1.0, suggesting that a 
one-factor model could be more appropriate (see the next section). 


inconsistent with the theoretical restrictions implied by the model. 
The British sample obtained estimates, but a negative residual 
variance suggested the model was not correct (this study is reex- 
amined in the single-factor results). 

Two of the Spanish studies failed to converge, due to heteroge- 
neous correlations inconsistent with the two-factor model (Bursztyn, 
1998; Cisero & Royer, 1995, Experiment 2, Time 1). Four of the 
seven estimable Spanish studies fit well as two-factor models, but the 
three studies in the lower part of the Spanish section of Table 1 had 
some evidence of misfit. The estimated latent correlations between 
English and Spanish were high, ranging from 0.66 to 0.98. 

All three two-factor models in Korean fit well, with latent 
correlations ranging from 0.89 to 0.93 (see Table 1). These high 
correlations were tested further in Research Question 2. 

In Cantonese, only two of the five studies had admissible 
two-factor models. Both of these models fit well, with latent 
correlations of 0.92 and 0.94. The three studies with estimation 


In Mandarin, four of the six studies had admissible solutions, 
with three of them fitting well. The latent relations, however, were 
heterogeneous, ranging from 0.06 to 0.94. 


Tests of a Single Factor Across Language (Question 2) 


In order to test bilingual phonological awareness as a single 
factor across languages, we also fit a single-factor model to the 
studies listed in Table 1. Residual method correlations, as in Figure 
Al in the Appendix, were still allowed. Table 2 lists the fit 
statistics for single-factor models for these studies (as in Table 1). 
In addition, the rightmost column of Table 2 reports the p value of 
the chi-square test of model restriction, comparing the single- 
factor model to the less restrictive two-factor model in Table 1 
(likelihood ratio test, marked with a superscript if p < .05). If this 
chi-square test is above p = .05, the fit of the one-factor model is 
acceptable, compared to the two-factor model. 
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Table 2 


Fit Statistics for Single-Factor Models (PA Is Unitary and Language General) 


Language and study (sample) xX? (df) 
Greek 
Loizou & Stuart (2003; Greek) 25.3 (16) 
Loizou & Stuart (2003; British) 39.4 (16) 
Spanish 
Atwill et al. (2007) 9.5 (2) 
Atwill et al. (2010) 17.22) 
Branum-Martin et al. (2006) 35.7 (6) 
Cisero & Royer (1995, Experiment 1) 6.9 (6) 
Cisero & Royer (1995, Experiment 2, Time 1) 155.2 (6) 
Cisero & Royer (1995, Experiment 2, Time 2) 40.9 (6) 
Leafstedt & Gerber (2005) 30.7 (16) 
Gottardo & Mueller (2009) 47.0 (13) 
Bursztyn (1998) — 
Korean 
Kim (2009) 0.4 (2) 
Wang, Park, & Lee (2006) 5.2 (6) 
Cho & McBride-Chang (2005) 10.5 (8) 
Cantonese 
Luk (2003) 2.9 (7) 
Gottardo et al. (2001) 2.3 (4) 
Gottardo et al. (2006) 0.1 (1) 
McBride-Chang et al. (2006) 171.6 (A) 
Luk & Bialystok (2008) 37.5431) 
Mandarin 
Wang, Cheng, & Chen (2006) 0.3 (2) 
Wang et al. (2005) oe 
Wang et al.,(2009) 1.4 (2) 
Yan et al. (2005) 20.9 (6) 
Xu & Dong (2005) 24.2 (18) 
Tao et al. (2007) 18.4 (11) 


P 


.06 
<.01 


01 
<.01 
<.01 


<.01 
=< ()il 

01 
<.01 


82 
“2 
25 


.89 
68 
.80 
<.01 
.20 


86 
AO 
<.01 
0.11 
0.07 
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CFI RMSEA 90% CI SRMR p (diff) 
0.88* 0.18? [.00, .37] 0.07 n/a 
0.66 0.30° [.18, .42] ONS n/a 
0.94 0.24? [.10, .39] 0.04 0.01* 
0.91 0.26° PAS 87] 0.05 <.01° 
0.99 0.08° [.06, .10] 0.02 <.01% 
0.99 0.06 [.00, .23] 0.04 0.08 
0.68? 0.50° [.43, .57] 0.05 n/a 
0.88? 0.24" [.18, .32] 0.05 Ole 
0.94 0.10* [.04, .16] 0.08 0.75 
0.86 0:15? [.11, .20] 0.08 <= 01F 
— — — -— equivalent 
1.00 0.00 [.00, .20] 0.01 equivalent 
1.00 0.00 [.00, .18] 0.03 0.40 
0.98 0.06 [.00, .14] 0.04 0.44 
1.00 0.00 [.00, .10] 0.03 0.48 
1.00 0.00 [.00, .14] 0.02 0.48 
1.00 0.00 [.00, .27] 0.01 n/a 
0.627 0.627 5s) 0.16 n/a 
0.94 0.06 [.00, .12] 0.08 n/a 
1.00 0.00 [.00, .13] 0.02 equivalent 
— — — — equivalent 
1.00 0.00 [.00, .20] 0.03 equivalent 
0.79? 0.20? [.11, .29] 0.08 0.01 
0.99 0.04 [.00, .07] 0.03 0.07 
0.93 0.10* [.00, .17] 0.06 n/a 


Note. p(diff) is the p value for the chi-square test of model fit for each study, compared to the respective two-factor model in Table 1. A significant p 
value (e.g., < .05) in this goodness-of-fit test indicates the restricted single-factor model is significantly worse than the two-factor model. n/a = not 
applicable, because the two-factor model was not interpretable. equivalent indicates that the single-factor model had the same number of parameters as the 
two-factor model, and all fit statistics are identical to the two-factor results. 
“Indicates a lack of model fit (comparative fit index < .90, root-mean-square error of approximation > .06, standardized root mean residual > .08, 


p(diff) < .05). 


Only two of the 25 single-factor models were not estimable in 
Table 2 (Bursztyn, 1998; Wang, Perfetti, & Liu, 2005). Neither of 
these studies could fit a two-factor model, either (see Table 1). 
Their. covariance structures were heterogeneous, suggesting the 
theories were not appropriate to these samples. 

In Greek, the single-factor model did not fit well for either the 
Greek or the British subsample. Implications of this lack of fit will 
be discussed. 

In Spanish, three studies had overall reasonable fit for the 
single-factor model (Branum-Martin et al., 2006; Experiment 1 in 
Cisero & Royer, 1995; Leafstedt & Gerber, 2005). SRMR did not 
indicate misfit for any study, but seven of the eight studies had 
some degree of misfit in RMSEA and three also had a lack of fit 
in CFI. The test of restriction to a single factor fit for two of the 
seven models in which a test was possible. 

The single-factor model fit well for all three Korean studies. The 
restriction of two factors down to one factor fit for all three studies 
(see Table 2), suggesting that the one-factor model was adequate 
for all three studies of the Korean language. 

Four of the five Cantonese studies fit well as single-factor 
models. The two studies testable against a one-factor model also fit 
the restriction to a single factor. These single-factor Cantonese 


studies are examined for their substantive interpretation in the next 
section. 

In Mandarin, five of the six studies had interpretable results for 
the single-factor model. Four of the five models with results fit 
reasonably well. The estimates from these studies are examined 
next. 


Model Results (Question 3) 


Overall model fit statistics do not necessarily imply that the 
model is substantively meaningful. The resulting parameters 
should be interpreted with respect to their intended roles. In 
particular, the fully standardized factor loadings are validity coef- 
ficients, indexing the correlation between the measure and respec- 
tive factor: The higher the standardized loading, the better that 
indicator measures the intended factor (see Appendix A). 

The residual correlations indicate the extent to which there is 
excess shared method variance not explained by the phonological 
awareness factors in the model. These correlations reflect method 
effects, or relations due to similarities in the tests and other 
sources, rather than the factors. The best fitting model for each 
study is examined next. 
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Figure 1 presents the single-factor models for the Greek and 
British subsamples (Loizou & Stuart, 2003). Both models fit 
poorly (see Table 2). In the British subsample, the factor loadings 
for English rhyme oddity, onset oddity, and phoneme elision were 
low (0.26 to 0.48). The loadings for Greek rhyme oddity and 
phoneme oddity were also low (0.43 and 0.41). Method correla- 
tions were high for rhyme oddity (0.73) and phoneme oddity 
(0.63). Although the fit of the model for the Greek group was poor, 
the loadings were not disturbingly low. Method correlations for the 
Greek group shown in Figure 1 were high for phoneme oddity 
(0.68) and phoneme elision (0.54). 

Figure 2 shows the fully standardized results for the best fitting 
models for Spanish studies. The first three two-factor models in the 
left-hand column (Atwill, Blanchard, Gorin, & Burstein, 2007; 
Atwill, Blanchard, Christie, Gorin, & Garcia, 2010; Branum- 
Martin et al., 2006) all had good validity coefficients. The 
Branum-Martin et al. (2006) study had low to moderate method 
correlations (0.13 to 0.41). These three studies all had high latent 
cross-language correlations. The other two studies that had two- 
factor results (Experiment 2, Time 2, in Cisero & Royer, 1995; 
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Figure J. Best fitting results for Greek studies (fully standardized). 
Neither study in Greek had an estimable two-factor model. Neither study 
had acceptable fit for one-factor models. Estimates are shown for their 
validity coefficients and method correlations. PA = phonological aware- 
ness; E = English; G = Greek. 






Gottardo & Mueller, 2009) did not have good model fit (see Table 
1). Their validity coefficients were moderate to high, with high 
cross-language latent correlations (0.77 and 0.66). 

The remaining three studies in Figure 2 show the Spanish 
studies for which the single-factor model fit best. Cisero and 
Royer’s (1995) Experiment 1 fit well, had good loadings, but had 
some sizable method correlations (0.44 for rhyme detection and 
0.32 for final phoneme detection). Cisero and Royer’s Experiment 
2, Time 1, did not fit well (see Table 2), but had reasonable 
loadings and one high method correlation (0.64 for rhyme detec- 
tion). The Leafstedt and Gerber (2005) study fit well but had low 
validity coefficients for blending and segmenting phonemes (0.30 
to 0.47) in each language. The Bursztyn (1998) study did not 
obtain results and is not shown. 

Figure 3 presents the estimated one-factor models for the three 
studies in Korean. The Korean studies showed moderate to high 
validity coefficients (0.46 to 0.97) and reasonably low method 
correlations (0.01 to 0.37). 

Figure 4 presents the best fitting models for the Chinese studies, 
with Cantonese on the left side and Mandarin on the right. The 
single-factor model fit reasonably well for all the Cantonese stud- 
ies except for one (McBride-Chang, Cheung, Chow, Chow, & 
Choi, 2006). Despite the reasonable fit for four of the five Can- 
tonese studies, the loadings were only moderate to high (0.10 to 
0.92, excluding tone measures) and were particularly low for 
measures involving tones (0.22 to 0.59). Method correlations were 
generally moderate. The 5 X 5 study by Luk and Bialystok (2008) 
had particularly low loadings, even within languages, suggesting 
heterogeneous relations among the measures (and therefore a poor 
match to theory). 

The results for Mandarin on the right side of Figure 4 were even 
more heterogeneous than those for the models for the Cantonese 
studies. One study did not have dependable estimates (Wang et al., 
2005). Two studies did not fit well (Tao, Feng, & Li, 2007; Yan, 
Yu, & Zhang, 2005) and had only moderate to high loadings (0.36 
to 0.88), with low to moderate method correlations (—0.48 to 
0.50). In the two 1 X 3 studies (Wang, Cheng, & Chen, 2006; 
Wang et al., 2009), the single English measure of phoneme dele- 
tion was not a strong indicator (0.37 and 0.06), suggesting that 
despite model fit, the cross-language nature of the single factor 
was not well identified. The largest (4 4) Mandarin study (Xu & 
Dong, 2005) had good fit, good loadings (0.55 to 0.76), and low 
method correlations (-0.03 to 0.07). Across all Mandarin studies, 
measures involving tones had fairly uniform validity coefficients 
in a moderate range (0.39 to 0.48). 


Conclusions for Each Language 


To graphically summarize the findings and facilitate compari- 
sons, Figure 5 presents the latent variable model estimates of the 
cross-language correlation along with estimates from the meta- 
analysis by Branum-Martin et al. (2012). For each language, 
Figure 5 lists the 25 samples from 22 studies on the left axis. At the 
bottom of each language section, a square shows the model-based 
estimate of the cross-language correlation (with 95% CI) from the 
meta-analysis of all studies that used matched measures (see Table 
1 from Branum-Martin et al., 2012). For each study in the present 
analysis, a circle is shown for the estimated latent cross-language 
correlation (if estimable) from the two-factor model. The circle is 
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Figure 2. Best fitting results for Spanish studies (fully standardized). Model results are shown for the best 
fitting two-factor versus one-factor models. PA = phonological awareness; E = English; S = Spanish; init. = 
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filled if the two-factor model had acceptable fit and is empty if the 
model had two or more indices of substantial misfit (see Table 1). 
If the single-factor model was estimable, a triangle is shown at 1.0 
(filled if the model had acceptable fit and empty if two or more fit 
indices were outside acceptable range; see Table 2). 

Figure 5 shows that, in Greek, neither sample had a dependable 
two-factor model (no circles) and the one-factor models did not fit 
well (empty triangles). The measures were not homogeneously re- 
lated, with the British group having low validity coefficients and both 
~ groups having problematically high method correlations (see Figure 
1). It will remain to be seen how consistently Greek phonological 
awareness tasks correlate in future samples. 

In Spanish, the latent correlations shown by the circles in Figure 5 
are substantially higher than the estimated meta-analysis correlations 
for Spanish and English (square). These latent correlations are higher 
because the measurement error unique to the particular tasks was 
controlled, yielding disattenuated cross-language relations. These 


identification; detect. = detection. 


models imply that Spanish and English phonological tasks function 
relatively well as measures of a consistent construct in each language, 
with correlations ranging from 0.66 to 0.98. Although the correlations 
are not quite so high that phonological awareness is a single factor 
across languages, five of the eight single-factor models had reason- 
able fit and only two studies passed the restriction to a one-factor 
model (see Table 2). 

The Korean studies shown in Figure 5 show uniformly nish 
correlations and acceptable one-factor solutions. The estimated 
latent correlations from the two-factor models (black circles in 
Figure 5) are above the upper bound of the confidence interval for 
the four Korean studies from the bivariate meta-analysis (square), 
suggesting the correction for measurement error across multiple 
tests may be substantial. These high latent correlations and well- 
fitting single-factor models support the idea that phonological 
awareness tasks indicate a single factor across the English and 
Korean languages. 
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The five Cantonese studies in Figure 5 show that the one-factor 
model is plausible in terms of fit. However, examination of the low 
validity coefficients and high method correlations in Figure 4 
suggests we should retain some skepticism regarding such a sim- 
plistic model of Cantonese and English, especially with respect to 
measures of tone awareness. 

The six Mandarin studies listed in Figure 5 highlight the wide, 
problematic variability in these studies (also seen in Figure 4). 
Except for the one large study by Xu and Dong (2005), the latent 
correlations (circles) were far from the perfect correlations (trian- 
gles), even when both models fit. Although two of these studies 
had only a single measure of English phonological awareness, the 
low loading for the English measure in these studies (see Figure 4) 
suggests low convergent validity for English indicating a cross- 
language latent factor. This lack of convergent validity may indi- 
cate limits of the measures or samples or that the model is incor- 
rect: Phonological awareness tasks may function differently in 
Mandarin than they do in English, yielding two language-specific 


factors (perhaps even poorly related). In general, smaller models 
are less likely to reject and larger models are more difficult to fit, 
so it will remain to be seen to what extent these mixed results in 
Mandarin represent issues of specific measures, particular sam- 
ples, or a problem for our theory of phonological awareness as a 
coherent ability in Mandarin. 


Discussion 


The meta-analysis by Branum-Martin et al. (2012) provided a high 
level of detail on specific design, language, and task effects. However, 
that analysis focused on only single cross-language correlations 
nested within each study for measures that were the same in each 
language. The current study tested the extent to which multiple 
measures were consistent with a language-specific theory of phono- 
logical awareness (two factors) or a langudge-general theory of pho- 
nological awareness (one factor, across languages), all while exam- 
ining method correlations across similar measures. 

The current findings provide important clarification for the previ- 
ous findings for cross-language correlations of phonological aware- 
ness measures (Branum-Martin et al., 2012; Melby-Lervag & Lervag, 
2011). In particular, measurement error may have played a large role 
in lowering the correlations estimated in the meta-analyses. Three 
studies of Korean suggest that the Korean—English correlation among 
phonological measures adequately represents a single, cross-language 
factor. In Spanish, phonological tasks are highly related across lan- 
guage but might not represent a single ability. 

The findings in Cantonese, suggest the possibility of a high 
correlation with English for phonological tasks. However, the role 
of tone awareness is not clear. 

Finally, the nature of the cross-language relation of phonolog- 
ical tasks in Greek and Mandarin is unclear, due-to the mixed 
results across studies. Most of these studies were small, meaning 
that the correlation matrices used were not necessarily dependable. 

In addition, the role of tone awareness in Mandarin is also not clear, 
but results suggest it is not a strong indicator of phonological aware- 
ness (if at all). Tone awareness likely involves other cognitive and 
linguistic skills not strongly related to phonological awareness as 
measured by other tasks included in these studies (Chen et al., 2008). 


Considerations Concerning Chinese 


In Chinese, the factor loadings for syllable tasks were not 
always similar to the loadings for phoneme-level tasks, which may 
suggest some task-specific sources of variance related to psycho- 
linguistic grain size (Ziegler & Goswami, 2005, 2006). Children 
acquiring English need to employ both small and large grain-size 
strategies to read proficiently (Frost, 1998; Ziegler & Goswami, 
2006). Moreover, the Chinese writing system is even less phono- 
logically transparent than the English writing system. It is possible 
that children learning English along with Mandarin or Cantonese 
are less able to operate on small grain sizes than are monolingual 
English children, making phoneme-level tasks not only more dif- 
ficult but less related to their other phonological skills. Such 
task-specific differences have been noted in English studies (An- 
thony et al., 2002; Anthony, Lonigan, Driscoll, Phillips, & Bur- 
gess, 2003; Schatschneider, Francis, Foorman, Fletcher, & Mehta, 
1999) and Spanish studies (Anthony et al., 2011; Branum-Martin 
et al., 2006), especially for the more difficult task of segmenting 
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Figure 4. Best fitting results for Chinese studies (fully standardized; Cantonese in left column; Mandarin in 
right column). Model results are shown for the best fitting two-factor versus one-factor models. One study (Wang 
et al., 2005) failed to estimate in either model and so is shown with no estimates. PA = phonological awareness; 


E = English; C = Cantonese; M = Mandarin; del. = deletion; count. = counting; detect. 
del. = phoneme deletion; phon. = phoneme; recog. = recognition; ident. 


phonemes. Detailed studies for Chinese speakers are needed at 
different ages and levels of exposure to English instruction to 
examine how well phoneme-level tasks operate as indicators of 
phonological awareness. 

In addition, tone awareness might not be as closely aligned with 
the latent factor as other phonological awareness tasks (Chen et al., 


= detection; phon. 
= identification. 


2008). The extent to which grain size and tonal awareness consti- 
tute distinguishable factors or are consistent but weak indicators of 
a single latent ability will remain to be seen in further discriminant 
validity studies involving more tasks. Phoneme or tonal tasks in 
Chinese may have task-specific variability but otherwise could still 
be valid indicators of a single factor of phonological awareness (in 
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Figure 5. Estimates of cross-language correlation for confirmatory factor 
models, with estimates from prior meta-analysis. Squares represent the 
model-based correlation for that language (with 95% confidence interval) 
from the meta-analysis in Branum-Martin et al. (2012). Each circle repre- 
sents the estimated latent correlation for the two-factor model for that 
study. Empty circles indicate the two-factor model had two or more indices 
of model misfit. Triangles indicate a single-factor model of phonological 
awareness was estimable for that study—a plausibly perfect correlation— 
with empty triangles indicating substantial misfit. Studies without a circle 
or triangle could not be fit with a two- or one-factor model, respectively; 
exp = experiment. 


Chinese or across languages). From the results so far on just 11 
studies, there is some support for phonological awareness as a 
single factor within Chinese, but the role of tone awareness is 
unclear and potentially different from other tasks involving pho- 
nological awareness. 


Limitations 


Instruction, age, and second versus first language may each be 
important it activating languages in a bilingual person (Grosjean 
& Li, 2013), but they were not measured here. Melby-Lervag and 
Lervag (2011) found no effects of instruction upon phonological 
awareness. Branum-Martin et al. (2012) found cross-language 
correlations to be slightly lower in older samples. Cross-language 
effects may differ over time, and they likely influence the attrition 
and not just the learning of a language (Li, 2013). These compli- 
cating factors of development and instruction over time may be 
responsible for the lack of fit in some samples in the current study. 
Alternatively, these factors may be responsible for spurious fit of 
other models in the current sample of studies. Given the difficulty 
in estimating the effects of instruction and age in prior analyses 
(Branum-Martin et al., 2012; Melby-Lervag & Lervag, 2011), 
differences due to time, development, and instruction will require 
a larger base of studies to be evaluated adequately. 

Second, this study used English as the anchor language. Other 
bilingual studies may be similarly revealing, such as between 
Cantonese and Mandarin (Chen et al., 2008) and between Turkish 
and Dutch (Verhoeven, 2007). If phonological awareness is indeed 
universal, we would expect the current results to be clarified in 
bilingual children in many pairings of languages. Experiments 
with multilingual children who speak three or more languages can 
easily be examined in the current framework, either from person- 
level data or from summary statistics. 

Third, because of the complexity of fitting a structural equation 
model (SEM) to each study, the current study has not explicitly 
modeled across-study variation (e.g., full meta-SEM; Cheung & 
Chan, 2005, 2009). More studies might be required, but future 
examinations could model an across-language latent correlation 
across studies. Similarly, method correlations in various studies, 
such as a consistent effect of phoneme-level tasks, could also be 
directly modeled. Such models could be fit in meta-SEM (Cheung 
& Chan, 2005, 2009). 

Last, the current examination is at the level of summary statistics 
and only for studies that used bilingual children. Many studies used 
small samples and were not designed for factor models. Studies 
within languages at the item level and using detailed linguistic infor- 
mation regarding item features are required to validate these findings 
(e.g., Anthony et al., 2011). Although interesting and provocative 
models may be fit to summary statistics, fundamental experiments are 
needed to validate these models as representative of the underlying 
cognitive processes. Carefully designed cognitive experiments (e.g., 
Perfetti et al., 2005; Perfetti & Zhang, 1995) as well as computational 
linguistic models may be helpful (see reviews by Grosjean & Li, 
2013; Thomas & van Heuven, 2005). 


Moving Forward in Bilingual Research on 
Phonological Awareness 


The current analyses suggest some changes in thinking about 
bilingual research, at least for children who may speak more than 
one language (see Grosjean & Li, 2013; Ziegler & Goswami, 
2006). First, researchers should report full descriptive statistics in 
their studies to allow other researchers to evaluate and reexamine 
their work (Zientek & Thompson, 2009). Many studies on reading 
in bilingual samples that used phonological awareness measures 
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simply did not report cross-language correlations (Branum-Martin 
et al., 2012). Although several of the studies reported here were 
primarily concerned with the prediction of reading outcomes, their 
published correlations allowed us to examine structural questions 
among the phonological awareness predictors. Second, because 
researchers design measures to converge or discriminate on certain 
theoretical constructs in the presence of measurement error, latent 
variable approaches may help to develop better estimates of these 
constructs and their relations. Third, only one of the cited studies 
explicitly modeled classroom level differences. Frequently, stu- 
dents are assigned to classrooms at least partially on the basis of 
their linguistic abilities, so classroom-level differences are likely to 
be crucial in bilingual settings (Branum-Martin, Foorman, Francis, 
& Mehta, 2010; Branum-Martin et al., 2006, 2009). Moreover, 
classrooms likely differ in their amounts and methods of instruc- 
tion. We therefore wish to amplify the call for more multivariate, 
multilevel, and longitudinal examinations of bilingual phenomena 
(Genesee, Geva, Dressler, & Kamil, 2006). 

The current study highlights some crucial questions for future 
bilingual research in phonological awareness. First, to what extent are 
various phonological tasks in languages other than English indicative 
of a single underlying ability? Second, how highly related are pho- 
nological abilities across languages, or to what extent can we measure 
phonological awareness as a general human ability? Third, what are 
the limiting or facilitating roles of cognitive development, instruction, 
and the writing system of that language for the nature of phonological 
awareness ahd its role in reading? 


Conclusion 


The current findings give intriguing, if inconsistent empirical 
support for the idea of phonological awareness being language 
general. This study extends the findings of the National Literacy 
Panel on Minority Language Children and Youth (August & 
Shanahan, 2006) regarding the importance of foundational skills 
and their cross-language effects. The questions of what kind and 
how much native language versus target language instruction best 
facilitates literacy acquisition (and for what kinds of children) will 
be important to pursue in future research. The current findings 
provide a basis for approaching such questions by providing a 
basic framework for conceptualizing the role of phonological 
awareness across languages. Important extensions will include 
examining the effects of different instructional models (e.g., tran- 
sitional, immersion, and dual-language) upon these cross-language 
effects (Branum-Martin et al., 2012; Melby-Lervag & Lervag, 
2011). We hope that the current work sparks productive future 
investigations in this area, leading to improved understanding and 
delivery of bilingual education. 
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Appendix A 


Example of a Confirmatory Factor Model 


To illustrate how multiple tasks can be used to test hypotheses 
of measuring trait versus method effects, we present a conceptual 
example with a specified model. Consider an example study in 
which three tasks are given to children in English (e.g., blending, 
segmenting, and elision) and three tasks are given in another 
language (e.g:, blending, segmenting, and sound matching). 

Figure Al presents a structural equation model diagram of a 
confirmatory factor model for these three tasks in two languages 
with fully standardized results. Circles represent latent factors: the 
number and type of constructs (traits) we wish to test. Rectangles 
represent observed test scores. Straight arrows represent measure- 
ment relations (pattern coefficients or loadings). Curved, double- 
headed arrows represent correlations. Gray curved, double-headed 
arrows represent variance (standardized to 1.0). 

Overall, this model implies a particular pattern of correlations 
among variables: Correlations are caused by the proposed factors, 
with method correlations, each of which can be evaluated. Where 
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_ Figure Al. Specification of a cross-language model of phonological 
awareness (fully standardized). Mean structure not shown. PA = phono- 
logical awareness. 


measures are the same in both languages, a residual method 
correlation can be included. If measures are not the same across 
languages, the factor model implies that correlations among tests 
are caused only by the hypothesized factors. This sort of model has 
four important characteristics for questions of cross-language pho- 
nological awareness: 


1. Overall fit of the model represents the extent to which 
the data fit this theoretically specified model. That is, 
indices of model fit suggest how closely the covariance 
structure implied by the theoretical model matches the 
actual covariance reported for the study. Good model fit 
supports validity of the theoretically specified con- 
structs, and rejection of the model suggests that our 
theory for the number and pattern of constructs is not 
correct (Borsboom, 2005; Marsh et al., 2005; Marsh, 
Hau, & Wen, 2004). 


2. Factor loadings (A) represent the sensitivity or quality of 
that test for measuring the proposed latent factor (Bol- 
len, 1989; Gustafsson & Aberg-Bengtsson, 2010; Mc- 
Donald, 1999). The squared loadings from the fully 
standardized solution represent the proportion of vari- 
ance in the test explained by the latent factor (R7). 


3. The cross-language correlation between factors (curved, 
double-headed arrow on the left side) represents the 
relation between latent constructs, corrected for mea- 
surement error. If this correlation is high, it could sug- 
gest a more parsimonious model of only one latent 
factor across languages. A model of a single, cross- 
language factor can also be tested. 


(Appendices continue) 


124 BRANUM-MARTIN, TAO, AND GARNAAT 


4. The method correlations (right side) represent resid- to which these method correlations are higher than 
ual relations due to that specific method (e.g., blend- the loadings or the latent correlation suggests that 
ing) that is not predicted by the common factor of method effects may be more important than trait 
phonological awareness in that language. The extent effects. 


Appendix B 
Table of Study Characteristics 


er ee ES SS ee 


Study and language Measures Sample English Other 
Greek 
Loizou & Stuart 6 X 6 n = 18 in Greece; 16 in rhyme oddity, syllable completion, onset rhyme oddity, syllable completion, 
(2003) UK; pre-kindergarten oddity, initial phoneme identification, onset oddity, initial phoneme 
single phoneme onset oddity, identification, single phoneme 
phoneme elision onset oddity, phoneme elision 
Spanish 
Atwill et al. (2007) Dexe)) n = 68; kindergarten, initial sound fluency, phoneme initial sound fluency, phoneme 
southwestern US segmentation fluency segmentation fluency 
Atwill et al. (2010) 22 n = 68; kindergarten, initial sound fluency, phoneme initial sound fluency, phoneme 
southwestern US segmentation fluency segmentation fluency 
Branum-Martin et aa n = 812; kindergarten, blending nonwords, segmenting words, blending nonwords, segmenting 
al. (2006) southwestern US phoneme elision words, phoneme elision 
Cisero & Royer 3053 n = 36-99; rhyme detection, initial phoneme rhyme detection, initial phoneme 
(1995) kindergarten—first detection, final phoneme detection detection, final phoneme 
grade, northeastern detection , 
US 
Leafstedt & Gerber 4x4 n = 89; first grade, onset, rime, blending, segmenting onset, rime, blending, segmenting 
(2005) western US 
Gottardo & 3x4 n = 114; Grades 1-2, phoneme detection, phoneme elision, phoneme detection, rhyme 
Mueller (2009) Canada blending nonwords identification, initial phoneme 
matching, final phoneme 
matching 
Bursztyn (1998)* eS, n = 45; Grades 1-2, quick rhyming segmenting, blending, initial 
northeastern US phoneme matching 
Korean 
Kim (2009) Lx 3 n = 33; kindergarten, total of blending, matching, segmenting blending and segmenting: rimes, 
eastern and western body-coda, phonemes 
US 
Wang, Park, & Lee aes n = 45; Grades 1-3, onset detection, rhyme detection, onset detection, rhyme detection, 
(2006) eastern US phoneme deletion phoneme deletion 
Cho & McBride- oe) n = 91; Grade 2, Korea syllable deletion, coda deletion, syllable deletion, onset phoneme 
Chang (2005) phoneme deletion deletion, coda phoneme deletion 
Cantonese 
Luk (2003) Shes n = 33; Grade 1, syllable deletion, phoneme onset syllable deletion, phoneme onset 
Canada deletion, phoneme counting deletion, tonal awareness 
Gottardo et al. Bae), n = 65; Grades 1-8, rhyme detection, phoneme detection, rhyme detection, tone detection 
(2001) Canada phoneme deletion 
Gottardo et al. 22 n = 40; Grades 1-8, rhyme detection, phoneme deletion rhyme detection, tone detection 
(2006) Canada 
McBride-Chang et ee) n = 217; kindergarten, syllable deletion, phoneme onset syllable deletion, phoneme onset 
al. (2006) Hong Kong deletion deletion 
Luk & Bialystok 3X5 n = 57; Grade 1, onset syllable deletion, medial syllable onset syllable deletion, medial 
(2008) Canada deletion, final syllable deletion, onset syllable deletion, final syllable 
phoneme deletion, phoneme counting deletion, onset phoneme 
deletion, tone awareness 
Mandarin 
Wang, Cheng, & 1x3 n = 64; Grades 1-5, phoneme deletion onset awareness, rime awareness, 
Chen (2006)* eastern US tone awareness 
Wang et al. (2005) ax n = 46; Grades 2-3, onset matching, rime matching, onset matching, rime matching, 
eastern US phoneme deletion tone matching 
Wang et al. 1 x 3 n = 78; Grade 1, phoneme deletion onset awareness, rime awareness 
(2009)* eastern US tone awareness 
Yan et al. (2005) Bas n = 64; kindergarten, syllable recognition, rhyme, phoneme syllable recognition, rhyme, 


China (Beijing) identification phoneme identification 
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Appendix B (continued) 


a ee ee a ee ee 


Study and language Measures Sample English Other 
Xu & Dong (2005) 4x4 n = 302; Grades 1, 3, rime oddity, onset oddity, final phoneme __ rime oddity, onset oddity, 
5, Beijing oddity, phoneme counting phoneme oddity, tone oddity 
Tao et al. (2007) 3x4 n = 74; Grades 3 and syllable detection and deletion, onset- syllable detection and deletion, 
5, Beijing rime detection and deletion, phoneme onset-rime detection and 
detection and deletion deletion, phoneme detection and 
deletion, tone detection and 
substitution 





Note. An asterisk indicates a study excluded from the previous meta-analysis (Branum-Martin et al., 2012) for not having matched measures across 
language. Italics indicate a measure that is matched across languages. The Measures column shows the number of measures in English first and the other 


language second, so that 1 X 3 indicates that correlations were given for 1 measure in English and 3 measures in the other language. UK = United Kingdom; 
US = United States. 
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Literacy Skill Development of Children With Familial Risk for Dyslexia 
Through Grades 2, 3, and 8 


Kenneth Eklund, Minna Torppa, Mikko Aro, Paavo H. T. Leppanen, and Heikki Lyytinen 
University of Jyvaskyla 


This study followed the development of reading speed, reading accuracy, and spelling in transparent 
Finnish orthography in children through Grades 2, 3, and 8. We compared 2 groups of children with 
familial risk for dyslexia—1 group with dyslexia (Dys_FR, n = 35) and 1 group without (NoDys_FR, 
n = 66) in Grade 2—with a group of children without familial risk for dyslexia (controls, n = 72). The 
Dys_FR group showed persistent deficiency, especially in reading ‘speed, and, to a minor extent, in 
reading and spelling accuracy. The Dys_FR children, contrary to the other 2 groups, relied heavily on 
letter-by-letter decoding in Grades 2 and 3. In children not fulfilling the criteria for dyslexia in Grade 2, 
the familial risk did not substantially affect the subsequent development of literacy skills. 


Keywords: reading development, spelling development, familial risk, dyslexia, longitudinal 


Literacy skills are a key to educational and occupational success 
in most societies. For a considerable proportion of the population, 
difficulties in reading and spelling development make them vul- 
nerable to underachievement throughout their school years and 
even beyond (Snowling, Adams, Bishop, & Stothard, 2001). Chil- 
dren with a family history of dyslexia represent a substantial part 
of this population: 34%-66% of children born to families with 
dyslexia have been reported to have severe difficulties in reading 
and spelling acquisition during the first grades at school (Penning- 
ton & Lefly, 2001; Puolakanaho et al., 2007; Scarborough, 1990; 
Snowling, Callagher, & Frith, 2003). The majority of studies of 
reading development have focused on reading accuracy, and less is 
known about the development of reading speed (Landerl & Wim- 
mer, 2008; Share, 2008) and spelling (Lervag & Hulme, 2010). In 
studies of reading speed, a few longitudinal follow-ups have 
spanned beyond Grade 3 (de Jong & van der Leij, 2003; Landerl 
& Wimmer, 2008; Parrila, Aunola, Leskinen, Nurmi, & Kirby, 
2005), but follow-ups at school age with samples including chil- 
dren with familial risk for dyslexia are scarce (see, however, 
Snowling, Muter, & Carroll, 2007; van Bergen et al., 2011). This 
longitudinal study examines reading and spelling development 
across Grades 2, 3, and 8 in three groups: children with familial 
risk for dyslexia and dyslexia in Grade 2, children with familial 
risk but no dyslexia in Grade 2, and children without a familial risk 
and without dyslexia. We had three aims: (a) to study the stability 
of reading and spelling skills beyond the literacy acquisition phase, 
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(b) to examine the effect of familial risk on reading and spelling 
development, and (c) to examine the effect of reading task and 
material (word list, text, and pseudoword text) on reading speed in 
different groups at different ages. 


Stability in Reading Speed, Reading Accuracy, 
and Spelling 


Only a few studies have described reading and spelling devel- 
opment from childhood to adolescence in a longitudinal design, 
and most of them have involved English-speaking children and 
have focused on the development of reading accuracy (Francis, 
Shaywitz, Stuebing, Shaywitch, & Flecther, 1996; Parrila et al., 
2005; Shaywitch et al., 1995). During recent years, reading speed 
and fluency (speed adjusted for accuracy) have begun to receive 
more attention in developmental reading research. In one of the 
few studies focusing on the development of reading speed, Lander] 
and Wimmer (2008) reported high stability and steady growth in a 
sample of German-speaking (Austrian) children in Grades 1, 4, 
and 8. Correlations between reading speed measures at different 
grade levels varied from .59 to .81, indicating high stability, which 
was confirmed at the individual level: eight out of 11 slow readers 
in Grade 1 were still at least one standard deviation below the 
sample average in Grade 8. Similarly, high correlations were 
reported in reading speed between words (.69) and nonwords (.66) 
in a shorter Dutch follow-up ranging from Grades 1 to 3 (de Jong 
& van der Leij, 2002) as well as in English between Grades 1 and 
2 in word list (.79) and oral text reading (.82) fluency (Kim, 
Wagner, & Lopez, 2012). In Finnish, correlations between Grade 
1 (fall) and Grade 2 (spring) have varied from .59 in text reading 
fluency (Parrila et al., 2005) to .67 in word recognition fluency 
(Torppa et al., 2007). 

The stability of reading accuracy has also been reported to be 
high. In an English-speaking Canadian sample, the across-grade 
correlations varied between .47 and .94 in the yearly assessments 
from Grades 1 to 5 (Parrila et al., 2005). In transparent orthogra- 
phies, the development of reading accuracy is very different from 
English, because the acquisition of reading accuracy in transparent 
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orthographies is fast. In a cross-language comparison of seven 
languages, Aro and Wimmer (2003) reported that the percentage of 
accurately read pseudowords approached 90% at the end of Grade 
1 in all six orthographies (German, Dutch, Swedish, French, Span- 
ish, and Finnish) other than English. Even children with dyslexia 
have been reported to read at least words with high accuracy after 
Grade 1: the average accuracy percentage was 91% in a Dutch 
sample of children with dyslexia (de Jong & van der Leij, 2003). 
Therefore, reading accuracy is seldom followed up and reported on 
in transparent orthographies after Grade 1. U.Leppinen, Niemi, 
Aunola, and Nurmi (2006) have, however, reported moderate to 
high correlations, ranging from .52 to .91, in reading accuracy of 
words and sentences in a Finnish sample in four assessments 
during Grades 1 and 2. 

As noted in various definitions of dyslexia, including the one 
from the International Dyslexia Association, problems in spelling 
are one key marker of dyslexia: “Dyslexia is a specific learning 
disability . . . characterized by difficulties with accurate and/or 
fluent word recognition and by poor spelling and decoding abili- 
ties” (p. 2; Lyon, Shaywitz, & Shaywitz, 2003). Spelling devel- 
opment, however, has attracted less attention (Caravolas, Hulme, 
& Snowling, 2001; Lervag & Hulme, 2010). There are studies that 
have examined the early prerequisites and predictors of spelling 
skill during the early grades of school in different orthographies 
(e.g., Furnes & Samuelsson, 2010; Kim & Petscher, 2011; U. 
Leppanen et al., 2006; Torppa et al., 2013; Wimmer & Mayringer, 
2002). But there are only a few longitudinal follow-ups that have 
examined the stability of spelling skill beyond the first grades at 
school in children without dyslexia (Abbot, Berninger, & Fayol, 
2010; Landerl & Wimmer, 2008; Lervag & Hulme, 2010), and 
with dyslexia (Shaywitz et al., 1999; Snowling et al., 2007). 
Several studies have shown that children with reading difficulties 
are often poor in both reading and spelling (de Jong & van der Leij, 
2003; Pennala et al., 2010; Pennington & Lefly, 2001; Puolaka- 
naho et al., 2008; van Bergen et al., 2012). In addition, the finding 
that spelling training in children with dyslexia enhances reading 
skills supports the idea of a close relationship between reading and 
spelling (Ise & Schulte-Kérne, 2010). However, dissociation be- 
tween spelling and reading has also been reported (Fayol, Zorman, 
& Lété, 2009; Moll & Landerl, 2009; Wimmer & Mayringer, 
2002). Reported correlations between two assessments of spelling 
have indicated moderate to high stability in English (.62—.92, in 
Grades 1-7; Abbott et al., 2010), in Norwegian (.47—.78, Grades 
1-3; Lervag & Hulme, 2010), and in German (.44-.77; Landerl & 
Wimmer, 2008). A tendency for stronger correlations between 
words compared with pseudowords (.67—.78 vs. .47-.59, respec- 
tively; Lervig & Hulme, 2010) as well as later versus earlier 
grades (.44-.47 in Grades 1-4 vs. .77 for Grades 4 and 8; Lander] 
& Wimmer, 2008) has also been reported. Correlational stability 
was not reported in the studies with a sample of children with 
_ dyslexia, but stable group differences between children with and 
without dyslexia were found between Grades 6 and 9 (ages 9-14 
years; Shaywitz et al., 1995) and between 8 and 12 years of age 
(Snowling et al., 2007). Our study is, to our knowledge, the first to 
examine the stability in reading speed and accuracy, as well as in 
spelling, in a sample of children with and without familial risk for 
dyslexia across a long time period from Grade 2 to Grade 8 (ages 
8-14 years). 


Familial Risk as a Continuum 


Several candidate susceptibility genes have been found to be 
linked to developmental dyslexia (Galaburda, LoTurco, Ramus, 
Fitch, & Rosen, 2006; Giraud & Ramus, 2012; Scerri & Schulte- 
Korne, 2010), and the idea of multiple risk factors, some of which 
are transmitted also to offspring without dyslexia, is widely ac- 
cepted (Bishop, 2009; Pennington, 2006; Pennington & Lefly, 
2001; Pennington et al., 2012; Snowling, 2008; Snowling et al., 
2003). Pennington (2006) has suggested that multiple risk factors 
both in the genome and environment lead to a continuum of 
vulnerability instead of a dichotomous distribution of risk. At the 
behavioral level, this suggestion has been tested by comparing the 
performance of children with familial risk, either with or without 
dyslexia, and controls. If the familial risk is continuous, the group 
of children with familial risk but no dyslexia also should show 
lower performance in the underlying cognitive skills (endopheno- 
types) compared with controls. The studies comparing these three 
groups have mainly shown that children with familial risk who do 
not fulfill the criteria of dyslexia perform significantly below the 
level of the controls in certain language and literacy skills both 
prior to and after school entry (Boets et al., 2010; Gallagher, Frith, 
& Snowling, 2000; Pennington & Lefly, 2001; Snowling, 2008; 
Snowling et al., 2003; van Bergen et al., 2011, 2012). 

In English-speaking children, Pennington and Lefly (2001) 
found that the scores of children with familial risk but without 
dyslexia in Grade 2 were significantly lower—on average, 0.5 of 
a standard deviation—than the scores of children with no familial 
risk and no dyslexia in all except one reading task. In line with this 
result, Snowling et al. (2003) found that the at-risk children with- 
out dyslexia showed poor performance in nonword reading and 
phonetic spelling at the age of 6 years and poor skills in spelling, 
nonword reading accuracy, and reading comprehension at the age 
of 8 years. In addition, in a follow-up study in adolescence Snowl- 
ing et al. (2007) reported that the at-risk unimpaired children had 
weaker performance than controls in exception word reading, text 
reading accuracy, and all timed reading tasks. However, the at-risk 
unimpaired children did not show deficient performance in word 
reading accuracy at 8 years (Snowling et al., 2003), either in 
untimed nonword reading accuracy or in reading comprehension in 
adolescence (Snowling et al., 2007). The classification of children 
with dyslexia in these studies was based on a composite score 
including word reading and spelling accuracy as well as reading 
comprehension (Snowling et al., 2003, 2007). 

Van Bergen et al. (2011, 2012) have also found evidence for the 
continuity of genetic liability of dyslexia in Dutch samples of 
children. At the end of Grade 2, children with familial risk but no 
dyslexia scored higher than children with familial risk and dyslexia 
but were impaired, compared with controls, in all literacy mea- 
sures (i.e., reading accuracy and fluency), and spelling (van Ber- 
gen et al., 2012). It is noteworthy that the differences between the 
groups were similar irrespective of whether the items were words 
or nonwords. In another Dutch sample (van Bergen et al., 2011), 
where dyslexia was diagnosed in Grade 5, the at-risk nondyslexic 
children performed worse in nonword reading fluency in Grades 1, 
2, and 5 than typically reading control children. However, in word 
reading fluency, the groups did not differ anymore in Grade 5. The 
classification of children with dyslexia was based solely on read- 
ing fluency (van Bergen et al., 2011, 2012). 
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On the other hand, the Dutch-speaking sample in Boets et al. 
(2010) showed support for the continuous nature of the effects of 
familial risk in prereading skills before school age only, but not in 
literacy skills at school age. Boets et al. (2010) found that the non- 
dyslexic at-risk children were poorer than control children in nonword 
repetition at kindergarten but not in Grades 1 and 3. They also found 
that this group was as good as the control group in word reading 
accuracy and speed as well as in nonword reading speed in Grades 1 
and 3. The only significant differences found between these two 
groups at school age were in nonword reading accuracy and spelling, 
both of which emphasize accurate decoding ability (Boets et al., 
2010). In a Finnish sample, no significant differences were found 
between children with familial risk but without dyslexia and typical 
readers from control families in reading-related prereading skills, 
including language skills and phonological sensitivity at age 1 year 6 
months through age 5 years 6 months and rapid serial naming and 
letter knowledge at age 3 years 6 months through age 5S years 6 
months (Torppa, Lyytinen, Erskine, Eklund, & Lyytinen, 2010). In 
Grade 2, the same groups did not differ from each other in reading 
accuracy or speed or in spelling, irrespective of whether the material 
was individually presented words or nonwords, or presented in the 
form of a list or text (Torppa et al., 2010). However, differences 
among the same three groups were found in brain responses to 
nonspeech pitch change in sounds at birth (P. H. T. Leppdnen et al., 
2010) as well as in the ability to discriminate speech stimuli with a 
barely perceivable difference in Grade 2 and in Grade 3 (Pennala et 
al., 2010). In the sample, the classification of dyslexia was based on 
reading speed and accuracy as well as on spelling accuracy (Pennala 
et al., 2010; Torppa et al., 2010). 

Because several factors vary between these studies (e.g., lan- 
guage and orthography, age and way of classifying dyslexia, 
stimuli and tasks used), it is difficult to draw firm conclusions of 
the reasons for differing findings. It seems, however, that differ- 
ences among the groups without reading difficulty and with or 
without familial risk are more clearly present early in the devel- 
opment of skills (Boets et al., 2010; Snowling et al., 2007; van 
Bergen et al., 2010). In addition, typical readers with or without 
familial risk have performed at the same level in tasks such as 
word reading and reading comprehension, where it is possible to 
make use of abilities other than phonological-decoding-related 
skills (i.e., semantic and syntactic skills and contextual cues) to 
facilitate reading (Boets et al., 2010; Snowling, 2007; Snowling et 
al., 2008; van Bergen et al., 2011) or when there is less pressure, 
such as no time limit (Snowling et al., 2007). Based on the 
previous findings from Grade 2 in the Finnish sample (Torppa et 
al., 2010) and the fact that group differences tend to diminish along 
with age (Boets et al., 2010; Snowling et al., 2007; Torppa et al., 
2010; van Bergen et al., 2011), we expected that children with 
familial risk but no reading difficulty in Grade 2 would not differ 
from the control children in any of the reading and spelling 
measures in Grades 3 and 8. 


The Effect of Task on Reading Speed 


Differences in reading speed across tasks have been interpreted 
to reflect different processes involved in different reading tasks. 
Reading pseudowords has generally been considered as a good 
measure of decoding ability because it requires grapheme-to- 
phoneme decoding (e.g., Coltheart, Rastle, Perry, Langdon, & 


Ziegler, 2001). Children with reading difficulty have been shown 
to have serious deficiency with this type of decoding, at least in 
opaque orthographies (Bergmann & Wimmer, 2008; Ziegler, 
Perry, Ma*Wyatt, Ladner, & Shulte-K6rne, 2003). In word read- 
ing, whether presented in a list or text format, the use of lexicon 
(i.e., activation of lexical representations) can substantially 
quicken reading speed by enabling fast whole-word recognition 
(Coltheart al., 2001; Frith, Wimmer, & Landerl, 1998). 

Reading time of dyslexic readers has been shown to be more 
dependent on word length both in pseudoword and word reading 
than in control children (Ziegler et al., 2003; Zoccolotti et al., 
2005). These findings have been interpreted to support the view 
that dyslexic readers rely more on phonological letter-by-letter 
decoding than typical readers. On the other hand, Bergmann and 
Wimmer (2008) have shown that even dyslexic readers (German 
speaking, ages 15-18 years) rely on the direct access to lexical 
information when reading from print to phonology for familiar 
letter strings, even though they are slower than nonimpaired read- 
ers. The so-called lexicality effect (i.e.,.the faster reading of word 
stimuli compared with reading nonwords), has been demonstrated 
to increase with grade level from Grade 1 to Grade 5 (Zoccolotti, 
De Luca, Di Filippo, Judica, & Martelli, 2009). This finding has 
been interpreted to be a result of more efficient use of the lexical 
information as children get older (Zoccolotti et al., 2009). This 
gradual shift from mainly using sequential letter-to-sound decod- 
ing to the predominant use of fast whole-word recognition during 
the development of reading acquisition gets support from Vaessen 
and Blomert (2010). Their study shows increasing speed differ- 
ences over years (Grades 1-6) between word and pseudoword 
reading. In the present study, we examined whether in Finnish, 
(similar to the case in Italian; Zoccolotti et al., 2009), children with 
dyslexia show a later developmental shift of emphasis from pho- 
nological decoding strategy to lexical processing than typically 
reading children. We assessed this shift by comparing speed in 
pseudoword text reading to word list and text reading in Grades 2, 
3, and 8. 

Skilled fluent reading is based on accurate and automatic word 
recognition in different contexts that facilitates the activation of 
semantic processes. Together with the appropriate use of prosody, 
reading fluency supports quick comprehension of reading material 
(Kuhn, Schwanenfugel, & Meisinger, 2010). Words in context are 
usually read faster and more accurately than the same words 
without context (Jenkins, Fuchs, van den Broek, Espin, & Deno, 
2003). According to Posner and Snyder (1975), there are two 
processes used for speeding up word identification in a textual 
context: automatic semantic activation of lexical memory and 
slow-acting attention-demanding conscious use of context and 
world knowledge. Jenkins et al. (2003) have shown that the mean 
reading rate of fourth graders with dyslexia was uniformly dis- 
crepant from skilled readers both in context and list. However, 
children with dyslexia seemed to benefit less than skilled readers 
from the context: their reading rate in text was 1.19 times of the 
rate of list reading, whereas in skilled readers the figure was 1.67 
(Jenkins et al., 2003). 

According to the verbal efficiency theory (Perfetti, 1985), defi- 
ciencies in children’s word reading proficiency affect their fluency 
skills. A certain level of word reading proficiency seems to be 
needed before cognitive resources may be released for the lan- 
guage processing needed in fluent text reading (Kim et al., 2012). 
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Skillful readers can identify the meaning of familiar words rapidly 
just by sight without effort (Ehri, 2005). Other factors besides 
activation of lexical representations may also speed up word 
recognition in text reading. Stanovich (1980) found that context 
allows readers to anticipate possible upcoming words, while eye 
movement studies have shown that it is possible to get information 
of the next word parafoveally before fixating on it (Hyéna, 2011). 
Barker, Torgesen, and Wagner (1992) demonstrated that ortho- 
graphic skills have a much stronger influence on reading speed of 
text, compared with the speed of single word identification: 20% 
vs. 5%, respectively. Deficiency in fluent access to word repre- 
sentations (i.e., poor orthographic skills) would therefore affect 
more reading speed of text in context and thus reduce the differ- 
ence in reading speed between text and single words. Longitudinal 
design, such as the one used in our study, can reveal whether and 
at what age children with familial risk and dyslexia acquire suffi- 
cient word decoding skills for the release of cognitive recourses in 
language processing in order to speed up reading text in context 
compared with word list reading. 


The Present Study 


In summary, our study addresses three questions. First, what is 
the stability of reading and spelling skills after the early reading 
acquisition phase? Second, what is the effect of familial risk on 
reading and spelling development? We compared the development 
of reading speed, reading accuracy, and spelling across Grades 2, 
3, and 8 in three groups of children: (a) those with both dyslexia 
and familial risk, (b) those without dyslexia but with familial risk, 
and (c) control children with no dyslexia and without familial risk. 
Third, are reading speed differences in varying reading tasks and 
materials (word list, text and, pseudoword text) similar across the 
three groups of participants and across Grades 2, 3, and 8? 


Method 


Participants 


All children (NV = 173) in this study were participants of the 
Jyvaskyla Longitudinal Study of Dyslexia (JLD; e.g., Lyytinen et 
al., 2008). They were originally selected for one of two groups: 
with familial risk for dyslexia or without familial risk for dys- 
lexia.! For this study, children were further allocated to three 
groups according to their reading and spelling skills at the end of 
Grade 2 and familial risk status: (a) children with dyslexia and 
familial risk (Dys_FR, n = 35), (b) children with no dyslexia and 
with familial risk (NoDys_FR, n = 66), and (c) a control group of 
children with no dyslexia and without familial risk (C, n = 72)e 
(See later descriptions of the familial risk and dyslexia). Charac- 
teristics of the groups are presented in Table 1. There were no 
differences between the groups in the parents’ age or education or 

_in the children’s performance IQ, age, or gender distribution. 
However, the verbal IQ in the Dys_FR group was lower than in the 
NoDys_FR and C groups, F(2, 169) = 6.63, p < .01. 

All the children spoke Finnish as their native language and 
had no mental, physical, or sensory impairments. An exclusion 
criterion was both verbal (VIQ) and performance IQ (PIQ) 
being below 80, which was assessed in Grade 2 using the 
Wechsler Intelligence Scale for Children (3rd ed.; WISC—III; 


Wechsler, 1991). Four performance scale subtests (Picture 
Completion, Block Design, Object Assembly, and Coding) and 
five verbal scale subtests (Similarities, Vocabulary, Compre- 
hension, Series of Numbers, and Arithmetic) were used to 
estimate the PIQ and VIQ, respectively. None of the partici- 
pants were excluded according to the exclusion criterion. All 
participants attended regular classroom education. 


Familial Risk: Screening of the Families 


The children were originally selected from among 9,368 new- 
borns born in the province of Central Finland between April 1993 
and July 1996. The selection was made using a three-stage proce- 
dure: (a) a short parental questionnaire including three questions 
concerning difficulties in learning to read and spell among parents 
and their close relatives (8,417 respondents); (b) a detailed parental 
questionnaire concerning the reading history, the persistence of 
reading and spelling difficulties, and the reading habits of parents 
and their close relatives (3,130 respondents); and (c) testing of the 
reading and spelling skills (410 parents). 

For the child to be originally included in the familial risk group 
(n = 108), either of the parents had to show deficient performance 
in oral text reading or spelling and in single word reading tasks 
tapping phonological and orthographic processing. In addition, a 
reported onset of literacy problems during early school years and 
a first-degree relative with corresponding difficulties were re- 
quired for inclusion in the familial risk group. In the group without 
familial risk, both parents (n = 92) had no reported family history 
for dyslexia and had a z score above —1.0 in all reading and 
spelling tasks described previously. The IQ of all parents, assessed 
with the Raven B, C, and D matrices (Raven, Court, & Raven, 
1992), had to be equal to or above 80 (for full details of recruit- 
ment, see Leinonen et al., 2001). 


Identification of Children With Dyslexia in Grade 2 


The identification of dyslexia was based on performance in 
five tasks (descriptions of the tasks will follow): (a) oral word 
and pseudoword reading, (b) oral text reading, (c) oral pseudo- 
word text reading, (d) oral word list reading, and (e) spelling 
words and pseudowords. Four measures of reading speed were 
calculated: (a) mean response time (reaction time + response 
duration) of correctly read words and pseudowords presented 
one by one, (b) the number of read words per minute in oral text 
reading task, (c) the number of pseudowords read per minute in 
oral pseudoword text reading, and (d) the number of correctly 
read words in 2 min in oral word list reading. Respectively, four 
measures of reading and spelling accuracy were calculated: the 
number of (a) correctly read words and pseudowords presented 
one at a time, (b) correctly read words in oral text reading, (c) 


! From the 200 children originally screened, 18 children refused to take 
part in the Grade 8 assessments, of whom three were from the group of 
children with reading disability and familial risk (Dys_FR), four from the 
group of children with no reading disability and with familial risk 
(NoDys_FR), and 11 were from the control group (children with no reading 
disability and without familial risk). 

?Nine children without familial risk fulfilled the criteria for reading 
disability at the end of Grade 2 and were excluded from this study as in 
other studies examining the continuity of the genetic risk. 
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Table 1 


Characteristics of Parents and Their Children in the Three Groups: Children With Dyslexia and Familial Risk, Children With No 
Dyslexia but With Familial Risk, and Control Children With No Dyslexia and Without Familial Risk 


\ 


Dys_FR NoDys_FR Controls > 
Paired group 
Variable M SD N (35) M SD N (66) M SD N (72) comparisons 
Parents 
Mother 
Age ; 29.62 4.26 29.32 4.22 29.67 4.10 Dys_FR = 
Education 4.09 1.42 440 1.44 4.60 1.34 NoDys_FR = C 
Father 
Age SIS oie oO 31.64 5.04 32 /OeD Oe 
Education sual OC!) Sl LAL 3.75 1.48 
Children 
WISC-III 
Verbal IQ 94.17 9.75 100.85 11.77 102.38 11.10 Dys_FR <NoDys_FR* 
Dys_FR < C™ 
Performance IQ 97.26 14.25 100.77 11.79 103.23 14.10 
Age (years) 
Grade 2 8.98 0.34 8.99 0.32 8.98 0.29 Dys_FR = 
Grade 3 9.99 0.45 O85 O32 9183 0:29 NoDys_FR = C 
Grade 8 14.48 0.44 14.30 0.54 14.35 0.28 
Gender 19 girls, 16 boys 32 girls, 34 boys 34 girls, 38 boys 


Note. Groups: Dys_FR = Dyslexia with familial risk; NoDys_FR = No dyslexia with familial risk; and Controls = Control children with no dyslexia 
and without familial risk. Parental education was classified using a 7-point scale: 1 = only comprehensive school (CS); 2 = CS and short-term vocational 
courses; 3 = CS and vocational school degree; 4 = CS and vocational college degree; 5 = CS and lower university degree/polytechnic degree; 6 = upper 
secondary general school and lower university degree/polytechnic degree; 7 = CS or upper secondary general school and higher university degree 
(master’s or doctorate). WISC-III] = Wechsler Intelligence Scale for Children (3rd ed.). 
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correctly read pseudowords in oral pseudoword text reading, 
and (d) correctly written words and pseudowords, presented one 
by one in a dictation task. 

For the identification of dyslexia, a two-step procedure was 
used. First, a cutoff criterion for deficient performance was defined 
for each of the eight measures using the 10th percentile of the 
control group’s performance. Second, a child was considered to 
have dyslexia if she or he scored below the criteria in at least three 
out of four measures of reading speed or in at least three out of four 
measures in reading and spelling accuracy. In addition, a child who 
scored below the criteria both in two speed and two accuracy 
measures was considered to have dyslexia. 


Measures 


Trained testers assessed reading and spelling skills individually 
in a laboratory setting with four different tasks in Grade 2 (June), 
Grade 3 (April), and Grade 8 (November) as a part of the JLD 
assessment procedure: (a) oral text reading, (b) oral pseudoword 
text reading, (c) oral word list reading, and (d) spelling pseudo- 
words. In all reading tasks, children were instructed to read “as 
quickly and accurately” as they could. Two different measures 
were calculated from each task: reading speed (the number of 
letters read in 1 s) and reading accuracy (the percentage of cor- 
rectly read items). Arithmetical means, calculated from the three 
oral reading tasks described previously, were used as composite 
measures of reading speed and reading accuracy separately for 
Grades 2, 3, and 8. The Cronbach’s alpha reliability for the reading 
speed composite was .93, .89, and .88 and for the reading accuracy 
composite .82, .83, and .75, in Grades 2, 3, and 8, respectively. 


Oral text reading (Grades 2, 3, and 8). At each grade level, 
participants read aloud an age-appropriate text for oral text read- 
ing. In Grade 2, the text (title “Exciting Journeys”) consisted of 19 
sentences in five paragraphs with a total of 124 words/877 letters 
(mean word length = 7.07 letters, and mean sentence length = 
6.53 words). For Grade 3, the text (title “Useless Belongings”) 
consisted of 18 sentences in four paragraphs and a total of 189 
words/1,154 letters (mean word length = 6.11 letters, and mean 
sentence length = 10.50 words). Finally, the Grade 8 text (title 
“Fjelds of Lapland”) consisted of 16 sentences in three paragraphs 
and a total of 207 words/1,591 letters (mean word length = 7.68 
letters/word, and mean sentence length = 12.94 words). Reading 
performance was recorded on a tape recorder (Grades 2 and 3) or 
a laptop computer (Grade 8). The total time to read the text was 
measured with a stop watch. The tapes and sound files were 
subsequently used to check the scoring of the children’s accuracy 
and speed. To assess the reliability of accuracy scoring, two 
trained coders independently scored accuracy in a randomly se- 
lected 10% of the sample, and the interrater agreement was .98. 

Oral pseudoword text reading (Grades 2, 3, and 8). 
Participants read aloud a short text made up of 19 pseudowords/ 
137 letters (Grade 2) or 38 pseudowords/277 letters (Grades 3 and 
8). The words and structure of the sentences resembled real Finn- 
ish in form but had no meaning. The mean word length was 7.21 
letters/word in Grade 2 and 7.29 letters/word in Grades 3 and 8. 
Similarly to the oral text reading, the child’s reading performance 
was recorded, and correctness of reading and time spent on reading 
were checked. In 10% of the sample, each pseudoword was judged 
by two coders as correctly or incorrectly read, and the interrater 
agreement was .95. 
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Oral word list reading (Grades 2, 3, and 8). In the Lukilasse 
standardized reading test (Hayrinen, Serenius-Sirve, & Korkman, 
1999), the participant has 2 min to read aloud as many words as 
possible from a 90-item (Grade 2) or 105-item (Grade 3) list, 
assembled vertically in columns. The same list that was used in 
Grade 3 was administered also in Grade 8, but the time limit was 
reduced to 1 min. The length of the words increased gradually, 
ranging from three to 18 letters/word in Grade 2 and from three to 
22 letters/word in Grades 3 and 8. The mean length of the words 
was 9.08 letters in Grade 2 and 9.57 letters in Grades 3 and 8. A 
trained tester marked the incorrectly read words as the child was 
reading aloud. The correctness of tester markings was checked by 
another listener in 10% of the sample using the recordings, and the 
interrater reliability was .99. 

Oral word and pseudoword reading (used only for the iden- 
tification procedure of dyslexia in Grade 2). Children read 
aloud three- and four-syllable words and pseudowords (10 of each 
type, altogether 40 items) presented one by one with the program 
Cognitive Workshop (Seymour, 1995) on a computer screen. 

Spelling pseudowords (Grades 2, 3, and 8). We measured 
spelling accuracy with a list of pseudowords consisting of 12 
four-syllable items in Grades 2 and 3 and 20 three- to five-syllable 
items in Grade 8. Participants listened through headphones as a 
computer presented the items twice with a 2-s interval. Each 
pseudoword was scored as correct if all the phonemes were cor- 
rectly written without missing or extra letters. The percentage of 
correctly written pseudowords was used as the spelling accuracy 
measure separately for each grade. Cronbach’s alpha reliability 
coefficients were .80, .71, and .70 for Grades 2, 3, and 8, respec- 
tively. 

Spelling words and pseudowords (used only for the identi- 
fication procedure of dyslexia in Grade 2). Participants used a 
pencil to write 6 four-syllable words and 12 four-syllable pseudo- 
words presented similarly as described previously. Each stimulus 
(word or pseudoword) was scored as correct if the participants 
wrote all the phonemes correctly without missing or extra letters. 
The percentage of correctly written words/pseudowords was used 
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as the spelling accuracy measure. Cronbach’s alpha reliability 
coefficient was .87. 


Results 


Distributions and Stability of Literacy Skills 


All distributions of reading speed measures were normal or 
close to normal. The distributions of reading and spelling 
accuracy, instead, showed a ceiling effect in all tasks in all 
grades. The ceiling effect was particularly clear in oral word list 
reading accuracy, with 82.5%, 89.0%, and 98.3% of the partic- 
ipants exceeding 90% accuracy in Grades 2, 3, and 8, respec- 
tively. The ceiling effect also appeared in oral text reading 
accuracy, where the portion of children above the 90% accuracy 
level was 79.2%, 86.2%, and 89.5% in Grades 2, 3, and 8, 
respectively. We applied logarithmic transformation to correct 
the distribution in the oral text reading task, whereas the dis- 
tributions of the tasks for oral word list reading and spelling 
pseudowords could not be normalized. Because of the nonnor- 
mal distributions, we conducted both parametric and nonpara- 
metric analyses when applicable. As all conclusions derived 
from the parametric and nonparametric analysis results were 
identical, we report only the parametric results. In reading and 
spelling accuracy measures, one to four extreme outliers were 
moved to the tail of the distribution before analyses to avoid 
overemphasizing their effects on results. No participants were 
dropped from the sample. 

Table 2 presents correlations between overall (averaged com- 
posite measure of) reading speed and accuracy as well as spelling 
accuracy. For the reading speed measures, the correlations be- 
tween performance across different grades were high (.72-.88). 
For reading accuracy measures, the correlations varied from mod- 
erate to high (.51—.69), and for spelling measure they were mod- 
erate (.41-.59). 


Spearman Correlations of Reading Speed, Reading Accuracy, and Spelling Accuracy in Grades 2, 3, and 8 


Reading speed 
Variable 1 2 3 


Reading accuracy Spelling accuracy 


4 5 6 7 8 


I t 


Reading speed 


1. Grade 2 _ 

2. Grade 3 .88""* — 

3. Grade 8 SS Se — 
Reading accuracy 

4. Grade 2 ‘SOrme aS 5" ‘0a 

5. Grade 3 5637 a {ee Are 

6. Grade 8 ie AT Bee 
Spelling accuracy 

7. Grade 2 Age" 46°" alae 

8. Grade 3 A0"* .40°™* Bie 

9. Grade 8 36° 5a 0s 


‘69> a 

mes oo a 

fit 49*** 40" = 

aS 4g" Or son Ss 
ai gen oie" A qn 


ZEEE 
Note. N = 173 inall correlations between the Grade 3 and Grade 8 measures, and N = 171 in correlations where a Grade 2 measure is included. Reading 
speed and accuracy are the arithmetic means of three oral reading tasks at each grade: word list, text, and pseudoword text. Spelling accuracy is the 


percentage of correctly spelled items in pseudoword spelling task. 
“* p. < -001. 
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Continuity of the Familial Risk: Group Differences in 
the Development of Literacy Skills 


We examined the development of reading speed and accuracy as 
well as spelling in the groups with mixed-design analyses of 
variance (ANOVAs) including grade (2, 3, and 8) as the within- 
subject factor and group (Dys_FR, NoDys_FR, and controls) as 
the between-subjects factor. For both reading speed and accuracy, 
a composite score was used as the measure at each grade level 
(arithmetic mean from the three tasks: list, text and pseudoword 
text reading). Figure 1 presents the development of each skill in the 
three groups. To evaluate the gain children made between two 
grades (Grade 2 and Grade 3; Grade 3 and Grade 8), a difference 
score was calculated by subtracting the corresponding means from 
each other. We used one-way ANOVAs to study group differences 
in these gains as well as in separate tasks of reading speed and 
accuracy and spelling in each grade. In the post hoc pairwise 
comparisons, we used either Bonferroni (when equal variances) or 
Dunnett’s T3 (when unequal variances) correction when evaluat- 
ing the significances of group differences (see Table 3). 

In the mixed-design ANOVA for the reading speed composite, 
both main effects, grade and group, were significant, F(1.62, 
271.69) = 724.69, p < .001, np = 81, and F(2, 168) = 49.79, p < 
001, np = .37, respectively, as was the Grade X Group interac- 
tion, FG.23, 271.69) = 2.93, p < .05, np = .03. For further 
evaluating the dissimilarity between the groups in the development 
of reading speed between Grade 2 and Grade 3 as well as between 
Grade 3 and Grade 8, the tests of within-subject contrast for the 
Grade X Group interaction were used. The effect was significant 
for the development between Grade 2 and Grade 3, F(2, 168) = 
T3Vy pes O0L; np = .08, but not for the development between 
Grade 3 and Grade 8, which suggests that the reading speed 
development differed between groups in Grade 2 and Grade 3, but 
not in Grade 3 and Grade 8. The ANOVA post hoc pairwise 
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comparisons (with Bonferroni corrections for significance) of the 
reading speed improvement between Grade 2 and Grade 3 showed 
that children in the Dys_FR group improved their overall reading 
speed more than the children in the control group (p < .001) and 
the NoDys_FR (p < .01) group between Grade 2 and Grade 3. 
However, the children in the Dys_FR group still did not reach the 
level of the other two groups as shown by the post hoc ANOVA 
comparisons at Grade 3 (see Table 3). The overall reading speed of 
children in the Dys_FR group was about 50%, 65%, and 75% in 
Grades 2, 3, and 8, respectively, from the reading speed of children 
in the two other groups (NoDys_FR and controls). In Grade 8, the 
overall reading speed of Dys_FR children was approximately at 
the level of third graders compared with the two other groups, 
indicating a lag of 5 years in development. Effect sizes were 
estimated (Cohen’s d computed using pooled standard devia- 
tion), and they were large not only for Grades 2 and 3 but also 
for Grade 8: Dys_FR versus NoDys_FR (d = 1.21) and Dys_FR 
versus control group (d = 1.73). No significant differences 
between the NoDys_FR and control group were found in 
ANOVA post hoc pairwise comparisons (with Bonferroni cor- 
rections for significance) of the gain children made in overall 
reading speed between Grade 2 and Grade 3 or between Grade 
3 and Grade 8. 

For each task in each grade level, we separately conducted 
one-way ANOVAs. These showed that children in the Dys_FR 
group read slower in all tasks throughout Grades 2, 3, and 8 
than the two other groups (see Table 3). The two groups without 
dyslexia (NoDys_FR and controls) did not differ from each 
other in any of the reading speed measures, although the effect 
sizes varied from small to moderate (.15-.42). 

In the analysis of the reading accuracy composite, both main 
effects, grade and group, were significant, F(1.75, 293.92) = 
104.28, p < .001, np = .38, and F(2, 168) = 83.12, p < .001, is = 
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Figure 1. 


Reading speed and accuracy (composite means) and pseudoword spelling accuracy in the three 


groups: children with dyslexia and familial risk, children with no dyslexia but with familial risk, and control 


children at Grades 2, 3, and 8. 
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Table 3 


Descriptive Statistics and Group Comparisons of Dys_FR, NoDys_FR and Control Groups With One-Way Analyses of Variance 





Effect size 
Fe Dys_FR vs. NoDys_FR Dys_FR vs.C NoDys_FR vs. C 
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(N = 35) (N = 66) (N = 72) 
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ext 
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Reading accuracy 
Overall 
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Text 
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Grade 3 90.00,. S2e 5.320 3193) 97277 1.89 
Grade 8 90.35, 51685 95.5115, 4.16 96.78, 22 
Pseudoword text 
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Grade 3 65102, Mecl S055 gS CO 2259034 8.38 
Grade 8 82.14, 12.94 92.03, 9:63 94:03, a2 
Spelling accuracy 
Grade 2 B9l05eem 28.998 74:36), 1732 74.53, 19.10 
Grade 3 CAS TRZIENSs 782 A599 16.671 812905 14.19 
Grade 8 822935 1536) 94:32. 7.23) )95:497 4.45 


S6)Siuas 2.07 212 40 
36.05*** 133 1.76 30 
30.79°** 1.21 1.73 ey 
AS Tas 1.80 1.88 34 
38.41°™* 157 1.87 27 
162302 0.97 1.23 lS 
54.01°™* 1.87 Delis 39 
41.99**™™ 1.61 1.89 oy 
STB 1.30 195) oD 
39.52" 1.79 1.70 239 

6.34°*" 0.57 0.72 19 
26.08°** 23 1.50 35 
580205 1.54 2.04 30 
60734 Ley Do 40 
Se 1.11 172 aul 
32.91°* 1.46 1.92 Al 
28.67°°" 1.05 1.50 40 
13.84°°* 0.96 0.83 .08 
33.24*** 1739 1.40 .03 
44.95""* 1.18 Uilsy .66 
31.84" 1.10 1.78 38 
38.10°™* 1.32 1.83 DS 
5858) 1 1.71 252 24 
20.84°* 0.92 1.44 2 
39.96" 1.62 1.59 .O1 
24.44°" 1.01 1.45 36 
29.90°*" 1.13 1.45 .20 


gest nie oe cee eee eee ee ee ee ee ee ee 
Note. Groups: Dys_FR = Dyslexia with familial risk; NoDys_FR = No dyslexia with familial risk; and C = Control children with no dyslexia and 
without familial risk. Groups with different subscript letter (x, y, or z) were significantly different in the post hoc pair-wise comparisons of analyses of 
variance F tests (p < .05). Bonferroni or Dunnett’s T3 corrections were used, depending on equality or inequality of the variances. Effect sizes were 
estimated with Cohen’s d (computed with pooled standard deviations). Overall reading speed = arithmetic mean of the number of read letters/second in 
word list, text, and pseudoword text reading. Overall reading accuracy = arithmetic mean of the percentage of correctly read words/pseudowords in word 
list, text, and pseudoword text reading. Spelling accuracy = percentage of correctly written pseudowords. 

Degrees of freedom varied between 2,165 and 2,170 due to missing data in single measures. 


Bi =/,001. 


50, respectively) as well as, the Grade < Group interaction, 
F(3.50, 293.92) = 11.72, p < .001, Np = .12. The test of within- 
subject contrasts for the Grade Group interaction was not 
significant between Grade 2 and Grade 3, but it was significant 
between Grade 3 and Grade 8, F(2, 168) = 18.78, p < .001, n, = 
18), a result that suggests that there was a difference in the 
developmental pace of reading accuracy between the groups in 
Grade 3 and Grade 8. The ANOVA post hoc pairwise comparisons 
(with Dunnett’s T3 corrections for significance) showed that be- 
tween Grade 3 and Grade 8, the children in the Dys_FR group 
developed faster in reading accuracy than did the children in the 
other two groups (both p < .001). However, as with reading speed 


described earlier, the children in the Dys_FR group did not quite 
reach the level of the other two groups (see Figure | and Table 3). 
In the Dys_FR group, the overall reading accuracy level reached 
90% in Grade 8, whereas the two groups without dyslexia (No- 
Dys_FR and controls), had reached the 90% level in overall 
reading accuracy already at the end of Grade 2. Effect sizes were 
large not only in Grades 2 and 3 but also for the Grade 8 group 
comparisons in reading accuracy: Dys_FR versus NoDys_FR (d = 
1.11) and Dys_FR versus Control group (d = 1.72). No significant 
differences between the NoDys_FR and control group were found 
in ANOVA post hoc pairwise comparisons (with Bonferroni cor- 
rections for significance) of the gain children made in overall 
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reading accuracy between Grade 2 and Grade 3 or between Grade 
3 and Grade 8. 

One-way ANOVAs, done separately for each task in each grade, 
showed that children in the Dys_FR group made more errors in all 
reading tasks throughout Grades 2, 3, and 8 than the two other 
groups (see Table 3). The two groups without dyslexia did not 
differ from each other in any of the reading accuracy measures, 
except in text reading accuracy in Grade 3. Effect sizes were small 
or medium (.03-.66) between these two groups in reading accu- 
racy measures throughout Grades 2, 3, and 8. 

In the analysis of pseudoword spelling both main effects, grade 
and group, were significant, F(1.87, 314.79) = 181.98, p < .001, 
n, = -52, and F(2, 168) = 49.57, p < .001, 1, = .37, respectively. 
Also the Grade X Group interaction was significant, F(3.75, 
314.79) = 11.86, p < .001, Np = .12. The test of within-subject 
contrasts for the Grade X Group interaction was significant be- 
tween Grade 2 and Grade 3, as well as between Grade 3 and Grade 
8, F(2, 168) = 8.64, p < .001, np = .09, and F(2, 168) = 5.84, p< 
01, Np = .06, respectively. The ANOVA post hoc pairwise com- 
parisons (with Dunnett’s T3 corrections for significance) showed 
that between Grade 2 and Grade 3, children in the Dys_FR group 
improved their spelling accuracy more than children in the control 
group (p < .05) and the NoDys_FR group (p < .01), and more 
than the control group (p < .01) between Grade 3 and Grade 8. 
Note, however, that the starting point of spelling accuracy in the 
two groups without dyslexia was approximately twice as high as in 
the Dys_FR group (with accuracy percentages of 73% vs. 39%). 
Although children in the Dys_FR group made better progress in 
spelling accuracy than the NoDys_FR and control groups, they 
reached accuracy level of 82.39% in Grade 8, which is comparable 
to the level of third graders in the other two groups (see Table 3). 
The group differences in Grade 8 were confirmed by effect sizes, 
which were large: 1.13 (Dys_FR vs. NoDys_FR) and 1.45 
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(Dys_FR vs. control group). The two groups without dyslexia, 
NoDys_FR and controls, reached close to 95% accuracy level in 
pseudoword spelling in Grade 8 and did not differ from each other 
in any of the spelling measures. Effect sizes were small or mod- 
erate (.01-.36). 


Differences in Reading Speed According to Task in 
Different Groups 


To see whether the differences in reading speed between the 
three tasks were similar in the three groups, we performed three 
separate mixed-design ANOVAs. Task (text vs. pseudoword text, 
text vs. word list, or pseudoword text vs. word list) was used as the 
within-subject factor, and group (Dys_FR, NoDys_FR, and control 
group) as the between-subjects factor. We did all ANOVAs sep- 
arately for each grade level (2, 3, and 8) to see whether the 
differences between tasks were similar in each grade. Because nine 
mixed-design ANOVAs were conducted, stricter than usual sig- 
nificance cutoffs were used to avoid family-wise errors. This was 
done by dividing the commonly used significance levels by the 
number of ANOVAs done. As a follow-up analysis, we compared 
performance in the different tasks within each group using paired 
sample f¢ tests. Figure 2 presents group differences in the three 
reading speed tasks in Grades 2, 3, and 8. Table 4 presents F values 
and estimates of the effect sizes of the mixed-design ANOVAs. 

Word list versus pseudoword text. We compared reading 
speed in word list and pseudoword text reading first to see the 
effect of lexicality. Both main effects, task and group, were sig- 
nificant. The interaction Task X Group was significant in Grades 
2 and 3. In paired sample f tests, the difference between tasks was 
not significant in the Dys_FR group, a result that suggests that 
children in this group read word lists and pseudoword texts at 
equal speeds. In the NoDys_FR group and in the control group, 
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Figure 2. Reading speed means in pseudoword text, word list, and text reading tasks in the three groups: 
children with dyslexia and familial risk, children with no dyslexia but with familial risk, and control children at 


Grades 2, 3, and 8. 
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F Values and Estimates of Effect Sizes From Mixed-Design Analyses of Variance With Reading Speed as the Dependent Measure, 
Task as the Within-Subjects Factor, and Group as the Between-Subjects Factor 





Compared tasks Main effect of task Effect size* 


Word list versus pseudoword text 


Grade 2 F (1, 168) = 176.47*™* ou 

Grade 3 F (1, 170) = 215.07*** 56 

Grade 8 F (1, 166) = 244.49*** .60 
Word list versus text 

Grade 2 F (1, 168) = 597.92*** 78 

Grade 3 F (1, 170) = 292.26*"** .63 

Grade 8 F (1, 169) = 152.17°* AT 
Text versus pseudoword text 

Grade 2 F (1, 168) = 1591.75**™* 78 

Grade 3 F (1, 170) = 1549.10°** 16 

Grade 8 F (1, 167) = 2249.39"™* 293 
Note. 


_ dyslexia and without familial risk, n = 72. 
“Effect size = partial eta square. 


Because multiple mixed-design analyses of variance were conducted, stricter-than-usual cutoffs for significance were used: “ p = .00S. 


children read the word lists about 1-2 letters/second faster than 
pseudoword texts, #(64) = 6.47, p < .001, and #65 = 12.14, p < 
.001, in Grade 2 and Grade 3, respectively, for the NoDys_FR 
group, and #(70) = 8.87, p < .001, and #(71) = 15.13, p < .001, 
for the control group in Grade 2 and Grade 3, respectively). In 
Grade 8, all groups read word lists about 2.5 letters/second faster 
than pseudoword texts, 7(32) = 11.36, 164) = 11.40, and (70) = 
9.60, all p < .001, for the Dys_FR, NoDys_FR, and control group, 
respectively (see Table 3). 

Word list versus text. We compared reading speed in word 
lists and text reading to see the effect of context on reading speed. 
Both main effects, task and group, were significant in all grades (2, 
3, and 8). The interaction Task X Group was significant only in 
Grade 2. In paired sample f tests, the difference between tasks was 
significant in all groups, 1(34) = 9.01, (64) = 17.03, and (70) = 
19.61, all p < .001, for the Dys_FR, NoDys_FR, and control 
group, respectively, indicating that all groups read words in con- 
text faster than isolated words. However, the ANOVA post hoc 
pairwise group comparisons (with Bonferroni corrections for the 
significance) indicated that the difference in reading speed be- 
tween word lists and texts was smaller in the Dys_FR group than 
in the NoDys_FR group and in the control group (both p < .001). 
In Grade 3 and in Grade 8, all groups read text faster than they read 
word lists, 34) = 6.92 and 134) = 7.48; 164) = 10.51 and 
1(64) = 8.26; and #(71) = 14.62 and (71) = 8.30, all p < .001, for 
the Dys_FR, NoDys_FR, and control group in Grade 3 and in 
Grade 8, respectively. 

Text versus pseudoword text. Finally, we compared reading 
speed between text and pseudoword text reading tasks to see the 
effect of the lexicality and meaning of the text. Both main effects, 

task and group, as well as the interaction Task X Group in Grades 
2, 3, and 8 were significant. In the ANOVA post hoc pairwise 
group comparisons (with Bonferroni corrections for the signifi- 
cance), the difference in reading speed between texts and pseudo- 
word texts was smaller in the Dys_FR group than in the 
NoDys_FR and control groups (all p < .001 in Grades 2 and 3, and 
both p < .01 in Grade 8; 2 vs. 4 letters/second, respectively, in 
Grades 2 and 3, and 4 vs. 5 letters/second in Grade 8; see Table 3). 


Interaction effect of 


Main effect of group, Effect size* Task * Group Effect size* 


F (2, 168) = 47.54""* 36... .F (2, 168) = 12.55 oe 
FQ, 170) = 25.74," 23 F(, 110).= 30.16" a 
F (2, 166) = 26.77" 24° F (2, 166) = 11.12 01 
F (2, 168) = 54.67" 39 -F(2, 168) = 14.55** 5 
F (2, 170) = 44.36"** 34 F(2, 170) = 13.14 04 
F (2, 169) = 29.98"** 26 -F(2, 169) = 11.06 01 
F (2, 168) = 58.93""* Al F (2, 168) = 27.09°* a 


Groups: Dys_FR = Dyslexia with familial risk, n = 35; NoDys_FR = No dyslexia with familial risk, n = 66; and C = Control children with no 


#0 <0. 


Discussion 


In this study, we examined three aspects of literacy develop- 
ment: the stability of literacy skills after the initial reading acqui- 
sition phase across Grades 2, 3, and 8; the effect of familial risk on 
literacy skill development during this period; and the effects of 
different types of reading material (word list, text, and pseudoword 
text) on reading speed. We compared the development of three 
groups: children with familial risk and dyslexia (the Dys_FR 
group), children with familial risk but without dyslexia (the No- 
Dys_FR group), and a control group of children with no dyslexia 
and without famiiial risk. 

We found high stability for reading speed development, whereas 
in reading and spelling accuracy, the development was moderately 
stable from the second to the eighth grade. Children with familial 
risk and dyslexia (the Dys_FR group) did not catch up to the other 
two groups in reading speed, reading accuracy, or spelling, al- 
though they progressed more than the other two groups in reading 
speed between Grade 2 and Grade 3, in reading accuracy between 
Grade 3 and Grade 8, and in spelling accuracy throughout the 
follow-up. The Dys_FR group’s literacy skills in Grade 8 were 
overall comparable to the level of the third graders in the two other 
groups. The children with familial risk but no dyslexia 
(NoDys_FR) did not differ significantly from the control group 
children in any of the assessed reading and spelling measures, 
except in text reading accuracy in Grade 3, although the effect 
sizes were often of moderate size between the two groups. The 
reading speed in children with familial risk and dyslexia varied 
less according to the type of reading material than in the two other 
groups in Grades 2 and 3, but this effect diminished in Grade 8. 

In reading speed, the correlations across groups between the 
grades were high (.72-.88). This indicates high stability of devel- 
opment and is in line with earlier findings in consistent orthogra- 
phies (Landerl & Wimmer, 2008; Parrila et al., 2005; Torppa et al., 
2007). The size of the correlation between the assessments in 
Grades 2 and 8 was .72, which showed that even after 6 years of 
school attendance, the relative positions of individuals remained 
very similar. The nearly parallel developmental paths of the three 
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groups confirm the idea of stability in reading speed. The only 
exception, the faster progress made by children in the Dys_FR 
group in reading speed between Grades 2 and 3, could be inter- 
preted to be a delayed developmental spurt that was made by 
normally developing children before the end of Grade 2. In pre- 
vious Finnish studies of reading accuracy (Aunola, Leskinen, 
Onatsu-Arvilommi, & Nurmi, 2002;U. Leppanen, Niemi, Aunola, 
& Nurmi, 2004) as well as in a Finnish study of oral reading 
fluency (Parrila et al., 2005), initial reading level has been found 
to be negatively associated with the development of reading skill 
during the first two grades at school. Our finding that children in 
the NoDys_FR and control groups made only a little progress in 
reading speed between Grades 2 and 3 suggests that this kind of 
negative association between the initial level and further growth in 
reading speed continues to be true until the end of Grade 3. 
Between Grades 3 and 8, the development in the three groups was 
highly parallel, and we found no evidence suggesting either catch- 
ing up or falling behind in any group. This supports the idea that 
differences between the groups are long-lasting, as has been found 
to be the case in Dutch (Boets et al., 2010; de Jong & van der Leij, 
2003; van Bergen et al., 2011) and in English (Francis et al., 1996; 
Snowling et al., 2007) readers with and without dyslexia. 

The consistent lag of the Dys_FR group in reading speed, 
present already at the beginning of the follow-up, could be ex- 
pected because in transparent orthographies the main characteristic 
of dyslexia has been shown to be slow reading (e.g., de Jong & van 
der Leij, 2003; Landerl & Wimmer, 2008; Landerl, Wimmer, & 
Frith, 1997; Wimmer, 1996; Zoccolotti et al., 1999). The magni- 
tude of the lag in Grade 8, approximately 5 years, was, however, 
larger than expected. De Jong and van der Lei (2003) have 
previously reported that Dutch children diagnosed with dyslexia in 
Grade 3 on the basis of reading fluency showed a delay of 3.5 
years by the end of Grade 6 compared with normal readers in 
reading speed. In addition, in earlier studies that have used a 
reading-level matched group as controls, the age difference has 
usually been 3—4 years on average (Constantinidou & Stainthorp, 
2009; Ziegler et al., 2003). 

The stability in reading accuracy development was moderate to 
relatively high according to correlations (.51—.69) between Grades 
2, 3, and 8 but lower than in reading speed. Correlations were 
somewhat lower than reported in previous studies of Finnish 
orthography (U. Leppanen et al., 2006; Parrila et al., 2005). The 
size of the correlations could be inflated by ceiling effects, but only 
for word list and text reading. In these tasks, where the items were 
real words, the percentage of correctly read words exceeded 90% 
before our first assessment point in this study (i.e., the end of 
Grade 2) in the NoDys_FR and control groups and in Grade 3 in 
the Dys_FR group. The accuracy percentages in the NoDys_FR 
and control groups are comparable to those reported earlier in 
transparent orthographies (Aro & Wimmer, 2003; de Jong & van 
der Leij, 2003). The ceiling effect also explains the finding that 
children in the Dys_FR group made better progress in reading 
accuracy between Grades 3 and 8 than children in the two other 
groups. After Grade 3, the children in the NoDys_FR and control 
groups simply had less room for development, having accuracy 
percentages at or above 96% in word list and text reading. In Grade 
8, the mean percentage of correctly read words in the word list 
reading task was above 97% in all groups. Our finding that most 
children in the Dys_FR group also acquired accurate reading of 


words is in line with the notion of de Jong and van der Leij (2003) 
that dyslexic children learning to read in a regular orthography 
eventually acquire sufficiently good skills in phonemic awareness 
to enable’ accurate decoding ability. However, reading accuracy 
concerning pseudoword items remained rather low in the Dys_FR 
group, even in Grade 8 (approximately 82%) and was equivalent to 
the accuracy of second graders in the NoDys_FR and control 
groups. This indicates persistent problems in phonological decod- 
ing among reading-disabled children when the demands of the task 
increases, in line with the findings of de Jong and van der Leij 
(2003). Previously, the parents of children with familial risk for 
dyslexia have been found to show difficulties in phonological 
decoding in in the Jyvaskyla Longitudinal Study of Dyslexia 
(P. H. T. Leinonen et al., 2001). 

Problems in phonological decoding were seen especially clearly 
in pseudoword spelling, in which children in the Dys_FR group 
started the follow-up in Grade 2 with a very low accuracy per- 
centage, 39%. Although they progressed faster than the other 
groups throughout the whole follow-up period, they remained 
behind children in the two other groups and ended up with a 
similar accuracy level as in pseudoword text reading (i.e., 82%) in 
Grade 8. This percentage is, however, much higher than the level 
of German-speaking Austrian children with reading and spelling 
difficulties: the mean of correctly spelled words for them was 
around 40%—45% in Grade 4 (Wimmer & Mayringer, 2002). This 
is probably due to the fact that in Germany a simple phoneme- 
grapheme translation is not sufficient for accurate spelling (Wim- 
mer & Mayringer, 2002). In contrast to German, Finnish orthog- 
raphy has symmetrically transparent correspondences between 
phonemes and graphemes, that is, both from the point of view of 
reading and spelling. The stability in spelling was moderate (.41— 
.59) but somewhat lower than found earlier in Finnish (U. Lep- 
panen et al., 2006). This discrepancy is probably due to this study’s 
use of pseudoword items, whereas in U. Leppdnen et al. (2006) a 
word-spelling task was used. In pseudoword spelling tasks, corre- 
lations have been found to be lower than between tasks including 
words (Lervag & Hulme, 2010). 

The greater gains in literacy skills by children in the Dys_FR 
group between Grades 2 and 3 could also be due to the extra 
support and intervention they have received. At Finnish schools, 
more than 20% of all school children in Grades 1-9 receive 
part-time special education at some point of the school year. This 
type of extra support is most frequent at the lowest grade levels, 
and the most common indication for part-time special education is 
problems in reading development. Altogether, 85.7% of children in 
the Dys_FR group received various kinds and amounts of extra 
support at school during Grades 1-3. This proportion is much 
bigger than the amounts of extra support in the NoDys_FR and 
control groups (34.8% and 11.1%, respectively). In addition, 
48.6% of children in the Dys_FR group (4.5% in the NoDys_FR 
and none in the control group) took part in an intensive interven- 
tion study (55 hr within 14 weeks) organized by the JLD project, 
including speech and auditory training as well as practicing of 
reading and writing. However, despite the support they have 
received, the literacy skills lagged substantially behind the skills of 
their peers. 

Our findings give weak support to the continuity of familial risk. 
The means of the NoDys_FR group fell between those of the 
Dys_FR group and of the control group, but the NoDys-FR and 
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control groups differed significantly in only one of the reading and 
spelling measures: text reading accuracy in Grade 3. Note also that 
the NoDys-FR group performed consistently better than the 
Dys_FR group. However, although the difference between the 
NoDys_FR group and the control group was not significant over- 
all, the moderate effects sizes suggest that with a larger sample 
size, we might have found significant difference. On the other 
hand, significant differences between groups with or without fa- 
milial risk and no dyslexia have been found with much smaller 
sample sizes in English (Snowling et al., 2003, 2007) and in Dutch 
(Boets et al., 2010; van Bergen et al., 2011). 

The majority of the findings of group differences in literacy 
skills before and at school age in English and Dutch have sup- 
ported the idea of continuity of familial risk (Boets et al., 2010; 
Pennington & Lefly, 2001; Snowling et al., 2003, 2007, 2008; van 
Bergen et al., 2011, 2012), although signs of diminishing group 
differences along with age have been reported (Boets et al., 2010; 
Snowling et al., 2007; Torppa et al., 2010; van Bergen et al., 2011). 
Because in our study, the NoDys_FR group and the control group 
differed from each other only in one task, no firm conclusions of 
diminishing versus expanding group differences could be made. 

Most prominent support for the continuous nature of familial 
risk comes from studies employing tasks relying heavily on accu- 
rate grapheme-to-phoneme decoding, that is, pseudoword or non- 
word word reading accuracy (Boets et al., 2010; Pennington & 
Lefly, 2001; Snowling et al., 2003, 2007; van Bergen et al., 2011). 
No differences between the two groups with or without familial 
risk and no dyslexia, on the other hand, have been reported in tasks 
where other than phonological processing could be used instead or 
as support of phonological decoding, that is, in word reading 
(Boets et al., 2010; Snowling et al., 2003; van Bergen et al., 2011) 
and reading comprehension in adolescence (Snowling et al., 2007). 
No differences have been reported either in reading task, where 
there has been no time pressure (i.e., untimed nonword reading 
accuracy; Snowling et al., 2007) or where the orthography of the 
language used has been extremely transparent, as in Finnish (Tor- 
ppa et al., 2010). Therefore, it seems reasonable to think that the 
requirements of the task or the transparency of orthography or both 
might affect the visibility of the continuity of familial risk. Finnish 
is in the shallowest end of the orthographic depth continuum, with 
close one-to-one correspondence between graphemes and pho- 
nemes. This high correspondence makes the learning of decoding 
and foundation level reading easy (Seymour, Aro, & Erskine, 
2003), and most of the children, even those with familial risk, can 
learn accurate decoding by the end of Grade 2. In English, on the 
other hand, where nonword reading has been found to be poorer 
than in more transparent German (Frith et al., 1998), the complex- 
ity and inconsistencies of orthography could bring out differences 
between groups. 

The discrepant results concerning the continuous nature of fa- 
milial risk can also be a consequence of differences in classifying 
children with or without dyslexia. Whereas in our study, we based 
the classification on reading speed, reading accuracy, and spelling, 
Snowling et al. (2003, 2007) based their classification on a com- 
posite score that included reading comprehension in addition to 
word reading and spelling accuracy. It is thus possible that slow 
readers with good comprehension skills, a group shown to be 
present at least in the Finnish sample (Torppa et al., 2007), might 
have ended up in the nondyslexia group. Likewise, van Bergen et 


al. (2011) based their classification of children solely on fluency. 
That is, they did not take reading or spelling inaccuracy as criteria. 
So, it is also possible that the group of at-risk nondyslexic children 
in the Dutch sample included children with difficulties in accuracy 
but not in fluency. This possibility is supported by the finding that 
in that study, children with typical reading skills but with the 
familial risk differed from control children only in pseudoword 
reading in Grade 5 (van Bergen et al., 2011), a task that relies 
heavily on accurate grapheme-phoneme decoding ability. Interest- 
ingly, in another Dutch-speaking sample, the family-risk nondys- 
lexia and control children were more similar to each other when 
the classification was based on word reading fluency, word reading 
accuracy, and spelling accuracy (Boets et al., 2010). To further 
explore this question, researchers should re-analyze the existing 
data sets applying uniform criteria in classification of children into 
subgroups. 

To better understand the slow reading speed in our Dys_FR 
group, we compared reading speed in different tasks across groups. 
In Grades 2 and 3, children in the Dys_FR group read pseudoword 
texts and word lists at equal speeds, whereas the two other groups 
read word lists about 1-2 letters/second faster than pseudoword 
texts. This raises at least two potential suggestions for conclusions. 
First, it might suggest that children in the Dys_FR group used the 
same processes in word and pseudoword reading, relying mainly 
on letter-by-letter decoding. This conclusion is in line with the 
findings of Ziegler et al. (2003) regarding English- and German- 
speaking children with dyslexia. In orthographically transparent 
Italian, Zoccolotti, et al. (2005) have found that Italian children 
with dyslexia showed a clear word-length effect in word reading, 
which suggests that the children were still using a sublexical 
reading procedures in Grade 3. However, Barca, Burani, Di Fil- 
ippo, and Zoccolotti (2006) reported in an Italian sample that by 
Grade 6 lexical reading appeared to be available even for children 
with dyslexia. In the sample of our study, a similar addition of 
lexical reading process seems to have taken place by Grade 8: 
children in the Dys_FR group read word lists about 2.5 letters/ 
second faster than pseudoword texts, similar to findings for the 
children in the two other groups, albeit with the overall lower 
speed. Second, the slower reading speed of the Dys_FR group in 
the word-list reading compared with the other two groups can be 
a consequence not only of poor decoding skills but also of diffi- 
culties in the use of orthographic lexicon, as suggested by Berg- 
mann and Wimmer (2008). These difficulties could result from 
their lower level of exposure to printed text and as a consequence 
less familiarity with the presented words; word frequency has been 
shown to have a strong effect on word recognition speed already in 
school-age children (Zoccolotti et al., 2009). Children in the 
NoDys_FR and Control groups seemed able to take advantage of 
their orthographic lexicon and recognize at least the most frequent 
and therefore familiar words by sight already in Grade 2. This is in 
line with the findings in another orthographically transparent lan- 
guage, Italian, where lexicality effect was already present in chil- 
dren for high-frequency words at the end of Grade 1 and for low 
frequency words in Grade 3 (Zoccolotti et al., 2009). 

In Grade 2, a similar kind of developmental lag seemed to be 
present also in the ability of children in the Dys_FR group to use 
contextual cues, such as syntactic and semantic information: the 
difference in reading speed between word lists and texts was 
smaller in the Dys_FR group than in the NoDys_FR and control 
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groups. At the end of Grade 2, decoding is still difficult in the 
Dys_FR group, and therefore fewer cognitive resources are left for 
language processing (Kim et al., 2012; Perfetti, 1985) in children 
with dyslexia. In Grades 3 and 8, all groups read texts approxi- 
mately 2 letters/second faster than word list, a result that is in line 
with the earlier findings in which the same words in context were 
read faster than without context by fourth graders (Jenkins et al., 
2003). Children with dyslexia were beginning to utilize contextual 
cues from Grade 3, at least 1 year later than normally developing 
children. And finally, the smaller difference throughout the 
follow-up period in reading speed between text and pseudoword 
text reading in the Dys_FR group suggests long-standing deficien- 
cies in automatization of decoding in familiar words, as suggested 
by Share (2008), or deficient use of word and subword level 
representations and contextual cues (Snowling, 2008; Stanovich, 
1980), or both. Methodological limitations, such as more than one 
varying factor in comparisons between the tasks, prevent us from 
making firm conclusions about the processes used in different 
tasks and to what extent they are specifically compromised in the 
Dys_FR group. 

In conclusion, the findings of the current longitudinal study 
confirm that the literacy difficulties of children with familial risk 
for dyslexia and dyslexia in Grade 2 are often persistent. On the 
other hand, in spite of the familial risk, children who have acquired 
the basic reading skills follow, for the most part, the developmen- 
tal track of children without reading difficulties or familial risk 
later on. In other words, it appears, at least on the group level, that 
if there are no signs of reading difficulties in Grade 2, one can 
anticipate typical literacy development also in later grades. But is 
this is true also at the individual level? Do the age-appropriate 
literacy skills shown here guarantee that these children with fa- 
milial risk of dyslexia also have age-appropriate reading compre- 
hension skills later, as shown by Snowling et al. (2007) with 
English-speaking children? These remain important questions for 
future studies. 
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Parallel and Serial Reading Processes in Children’s 
Word and Nonword Reading 


Madelon van den Boer and Peter F. de Jong 
University of Amsterdam 


Fluent reading is characterized by rapid and accurate identification of words. It is commonly accepted 
that such identification relies on the availability of orthographic knowledge. However, whether this 
orthographic knowledge should be seen as an accumulation of word-specific knowledge in a lexicon 
acquired through decoding or as a well-developed associative network of sublexical units is still under 
debate. We studied this key issue in reading research by looking at the serial and/or parallel reading 
processes underlying word and nonword reading. Participants were 314 Dutch 2nd, 3rd, and Sth graders. 
The children were administered digit, word, and nonword naming tasks. We used latent class analyses 
to distinguish between readers who processed the letter strings serially or in parallel, based on the 
correlation patterns of word and nonword reading with serial and discrete digit naming. The 2 classes of 
readers were distinguished for both word and nonword reading. The validity of these classes was 
supported by differences in sensitivity to word and nonword length. Interestingly, the different classes 
seemed to reflect a developmental shift from reading all letter strings serially toward parallel processing 
of words, and later of nonwords. The results are not fully in line with current theories on the 
representation of orthographic knowledge. Implications in terms of models of the reading process are 
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discussed. 
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Fluent reading is characterized by rapid and accurate identifi- 
cation of words. Such identification is commonly believed to 
depend on the availability of orthographic knowledge (e.g., Ehri, 
2005; Share, 2008). However, the proper representation of ortho- 
graphic knowledge in a model of reading is still under debate. On 
the one hand, it has been proposed that readers acquire word- 
specific knowledge and store this knowledge in a lexicon (e.g., 
Coltheart, Rastle, Perry, Langdon, & Ziegler, 2001; Jackson & 
Coltheart, 2001). Upon encountering familiar words in written 
form, pronunciation and meaning can immediately and automati- 
cally be retrieved from memory (Ehri, 2005). On the other hand, it 
has been proposed that the reading system is an associative net- 
work of interconnected sublexical units, without lexical memory 
for words (e.g., Plaut, McClelland, Seidenberg, & Patterson, 1996; 
Seidenberg & McClelland, 1989). First, we discuss these two 
approaches and their implications in more detail. Next, we con- 
sider methods to determine whether word identification is based on 
the retrieval of pronunciations from memory. 

According to the first or word-specific approach, fluent reading 
means reading by sight. For a word to be read by sight, a connec- 
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tion must be made between the orthographic form of a word and its 
previously acquired phonological counterpart (Ehri, 2005). Ac- 
cording to the self-teaching hypothesis (Share, 1995, 1999), a 
reader can acquire the detailed orthographic representations nec- 
essary for fast and efficient reading through phonological recoding 
of novel letter strings. Every time a reader successfully decodes a 
printed word into a phonological code, an orthographic represen- 
tation of that word is built or strengthened. Therefore, beginning 
readers initially rely on decoding skills to read words, but read 
more fluently when previous encounters with words have accu- 
mulated in well-established orthographic representations. 

This development of the reading system, from heavy reliance on 
decoding toward reading an increasing number of words by sight, 
fits well with the dual route cascaded (DRC) model of reading 
(Coltheart et al., 2001; Jackson & Coltheart, 2001). Therefore, the 
DRC model provides a useful framework in studying reading 
development, although it should be noted that the model is in- 
tended to model reading aloud of monosyllabic letter strings by 
adult fluent readers. Within the DRC model, two routes are dis- 
tinguished that are simultaneously active. Initial parallel identifi- 
cation of letter identities is common to both routes. Subsequently, 
phonology is activated through the lexical and nonlexical routes. 
Sight word reading is represented as reading through the lexical 
route. In the lexical route, word identification is achieved in 
parallel by successive activation of the word’s entry in the ortho- 
graphic and phonological lexicon. Decoding, dominating the pro- 
cessing of unfamiliar words or nonwords, is modeled with the 
nonlexical route. This route works in parallel to the lexical route, 
but graphemes are serially decoded into phonemes according to 
grapheme-phoneme conversion rules. As a result of reading expe- 
rience, one could expect a gradual shift in dominance from the 
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nonlexical route, when many words are decoded early in develop- 
ment, toward the lexical route, when an increasing number of 
words become represented in the orthographic lexicon and can be 
quickly recognized by sight. 

An important characteristic of the DRC model is that words can 
only be read by sight if a word-specific representation is present in 
the orthographic lexicon (e.g., Coltheart et al., 2001). In other 
words, reading development is item specific. Orthographic repre- 
sentations can exist only if words have previously been encoun- 
tered and decoded successfully. And words can be processed in 
parallel only if orthographic representations exist that are con- 
nected to the representations of the same words in the phonological 
lexicon. 

This idea of word-specific orthographic knowledge, however, 
stands in sharp contrast to the second approach in modeling the 
reading system. According to the parallel distributed processing 
model (PDP; e.g., Plaut et al., 1996), for example, word-specific 
representations do not exist. Rather, letter strings are read by a 
reading system based on parallel activation of interconnected 
orthographic, phonological, and semantic units. The interactions 
among these units are governed by connection weights that repre- 
sent the system’s knowledge of spelling—sound correspondences in 
the language input. Within this associative network of sublexical 
units, there is no orthographic or phonological lexicon for words. 

As a result of the different representations of orthographic 
knowledge as either word-specific or sublexical, the DRC and PDP 
models of reading also have different definitions of fluent reading. 
Fluent reading, in the DRC model (e.g., Coltheart et al., 2001), 
entails reading by sight, which occurs through parallel activation 
of phonology of a letter string by accessing representations in the 
orthographic and phonological lexicon. In contrast, fluent reading 
in the PDP model (e.g., Plaut et al., 1996) entails parallel activation 
of phonology from print via sublexical units. Both models, how- 
ever, predict that fluent word reading is achieved through parallel 
computation of phonology from the letter string. 

The models differ greatly in how they account for the reading of 
nonwords. According to the DRC model (e.g., Coltheart et al., 
2001), nonwords cannot be represented in the orthographic lexi- 
con, and as a result always require involvement of the nonlexical 
route. In contrast, PDP models do not presume a separate mech- 
anism for the reading of unfamiliar words and nonwords. Accord- 
ing to the PDP model (e.g., Plaut et al., 1996), all letter strings are 
read by the same reading system through parallel activation of the 
interconnected units. Nonwords, especially those that adhere to 
regular orthographic and phonological patterns, are not processed 
differently from words. 

A key issue in distinguishing between these two models of 
reading is whether phonological codes of words and nonwords are 
activated in parallel. Within the DRC framework, length effects 
have been studied as indicators of whether phonology is activated 
predominantly serially or in parallel. In the early stages of reading 
development, the speed of single word and nonword reading 
increases as a function of the number of letters, whereas in ad- 
vanced readers this length effect becomes restricted to longer 
words (i.e., more than six letters) and nonwords (e.g., Marinus & 
de Jong, 2010; Spinelli et al., 2005; van den Boer, de Jong, & 
Haentjens-van Meeteren, 2013; Weekes, 1997; Ziegler, Perry, 
Ma-Wyatt, Ladner, & Schulte-Korne, 2003; Zoccolotti et al., 
2005). A length effect is presumed to occur when words are 


identified through serial activation of phonology, whereas the 
absence of a length effect indicates that phonology is activated in 
parallel. In line with the DRC model, length effects remain for 
nonwords, which are supposed to be read predominantly through 
the nonlexical route. 

However, although a length effect is indeed expected when 
letter strings are decoded, the reverse—that an observed length 
effect is the result of decoding—is not necessarily true. In fact, 
length effects have been found that could not be ascribed to serial 
processing through the nonlexical route (Risko, Lanthier, & 
Besner, 2011; van den Boer, de Jong, & Haentjens-van Meeteren, 
2012). Risko et al. (2011), for example, found that increased 
spacing between letters resulted in increased effects of item length. 
These effects, however, were found at the level of letter identifi- 
cation, not serial activation of phonology. Similarly, Van den Boer 
et al. (2012) found length effects in the lexical decisions of 
children, while independent evidence suggested that items were 
processed in parallel through the lexical route. Together, these 
findings indicate that a length effect in and of itself does not prove 
that serial processes underlie word identification. Moreover, in 
PDP models, length effects are not interpreted in terms of decoding 
but are ascribed to other factors, such as visual and articulatory 
factors or differences in orthographic neighborhood size (Seiden- 
berg & Plaut, 1998; but see Plaut, 1999, for an attempt to model 
length effects within a PDP framework). Thus, length effects are 
expected when words are decoded, but a length effect in itself does 
not prove that words have been identified through serial decoding. 
Additional independent evidence for a serial or parallel reading 
strategy is called for. 

As an alternative, it has been proposed that parallel processing 
can be detected by the speed with which single words are read. 
Ehri and Wilce (1983) compared how quickly beginning readers 
could identify highly familiar, overlearned symbols (i.e., digits) 
with the readers’ word recognition speed. In skilled readers, 
response latencies to both digits and words were equal as early 
as first grade. In less skilled readers, however, similar response 
rates were obtained later, around third or fourth grade. These 
results indicated that even in the first years of reading devel- 
opment, words are no longer decoded but are processed in 
parallel and elicit the same routinized naming responses as 
overlearned symbols. Interestingly, Ehri and Wilce (1983) also 
included three-letter nonwords in their study and found that 
skilled readers also identified these nonwords as quickly as 
digits. Less skilled readers, however, identified nonwords 
slower than digits at least up to fourth grade. These findings 
suggest that nonword phonology could potentially be activated 
in parallel. 

In line with Ehri and Wilce (1983); Aaron et al. (1999) showed 
that if a word is processed in parallel, the speed of reading this 
word is close to the speed of naming letters. Similar results have 
also been reported by van den Bos, Zijlstra, and Van den Broeck 
(2003), who showed that naming speed of alphanumeric symbols 
(i.e., letters and digits) was closely related to monosyllabic word 
naming speed. However, naming speed is greatly influenced by 
word frequency (e.g., Forster & Chambers, 1973; Frederiksen & 
Kroll, 1976). The phonological codes of digits are very frequent, 
which results in relatively short naming latencies. Therefore, sim- 
ilar reading latencies to digits and words are probably only found 
when high frequency words are studied. Reading latencies to 
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words of lower frequency might not be equal to reading latencies 
of digits, even though these words might be processed in parallel. 

To get around this problem, de Jong (2011) argued that if word 
reading relies on a parallel retrieval process, individual differences 
in word reading and digit naming speed should be similar. There- 
fore, a high correlation should be found between word and digit 
naming, despite possible differences in absolute naming speed. 
More specifically, de Jong proposed to consider the relations of 
serial and discrete digit naming with word reading to determine 
whether a particular set of words is read by sight. Digit naming 
concerns the rapid naming of digits. Whereas in serial naming the 
digits are presented in rows, in discrete naming digits are presented 
one by one, on a computer screen. Naming latencies of digits 
presented in a discrete format were assumed to reflect lexical 
access speed, the retrieval of known phonological codes from 
memory. If words, also presented in a discrete format, are pro- 
cessed in parallel, a high correlation is expected with discrete digit 
naming. If, however, words are read through decoding, a stronger 
relation could be expected with a serial format of digit naming, 
because both activation of phonology and serial digit naming 
reflect a serial process. The correlation patterns were in line with 
both of these expectations in showing that for beginning readers 
(Grade 1), who are expected to rely predominantly on decoding, 
word reading was most strongly related to serial digit naming, 
whereas discrete digit naming was the stronger correlate for more 
advanced readers, who are expected to processes short words in 
parallel (Grades 2 and 4). 

As a next step, de Jong (2011) showed through latent class 
analyses that the children from the three grades could be assigned 
to two classes of readers. For a large class of readers, single word 
reading related strongly to discrete digit naming. For a second, 
smaller class of readers, however, single word reading related 
more strongly to serial digit naming. This suggested that the first 
class of readers processed the words in parallel, similar to naming 
a digit. The second class of readers, however, was not processing 
the words in parallel but predominantly relied on a serial decoding 
strategy instead. De Jong argued that this classification is fully 
compatible with an item-specific view of reading development, 
such as the DRC model (e.g., Coltheart et al., 2001). Whether a 
reader processed the words in parallel or serially depended on 
whether the words in the set were represented in the lexicon or not. 
If the words were represented in the lexicon, the words were read 
by sight. If the majority of the words in the set were not repre- 
sented in the lexicon, the main reading strategy would be serial 
decoding. In other words, the classifications depend on the words 
that were presented. The number of classes would vary with the 
number of word sets used, and the sizes of the classes with the 
difficulty of the words included. 

In the current study we focused on word and nonword reading 
in Grades 2, 3, and 5. For word reading, we expected to find two 
classes of readers, namely, serial and parallel processors. More 
importantly, we studied whether these results are tied to a partic- 
ular set of words by studying whether similar classes of readers 
could be distinguished for nonword reading. According to an 
item-specific view of reading development, and in line with the 
DRC model, all readers should have a predominantly serial reading 
strategy for nonwords; thus, only one class of serial nonword 
readers should be identified. These predictions are tested against 
the predictions of the PDP model (e.g., Plaut et al., 1996), which 


states that both words and nonwords are processed in parallel by 
all readers. Thus, a single class of parallel processors would be 
expected for both word and nonword reading. If nonwords, like 
words, can be processed in parallel, this would indicate that serial 
and parallel reading processes were not tied to particular sets of 
words but could potentially be generalized to all short words and 
nonwords. A second novel aspect in the current study is the focus 
on validating the interpretation of the different classes of readers 
by examining length effects. Reading latencies of serial processors 
are expected to be affected by word length, whereas the reading 
latencies of parallel processors are hypothesized to be independent 
of length. 


Method 


Participants 


A total of 314 Dutch children participated in the study. One 
hundred seventeen children attended second grade (52 boys, 65 
girls), 86 third grade (44 boys, 42 girls), and 111 fifth grade (51 
boys, 60 girls). The mean ages of the children were 8 years (SD = 
5.70 months) in Grade 2, 9 years 4 months (SD = 6.58 months) in 
Grade 3, and 11 years (SD = 5.86 months) in Grade 5. All children 
attended mainstream primary education. Scores on the One Minute 
Reading Test (Eén Minuut Test; Brus & Voeten, 1995), a stan- 
dardized test of word reading fluency with an average of 10 and a 
standard deviation of 3, showed that the sample included a repre- 
sentative range of reading abilities (Grade 2: M = 10.66, SD = 
2.93; Grade 3: M = 10.54, SD = 2.52; Grade 5: M = 9.25, SD = 
2.58). All children had normal or corrected to normal vision. 


Measures 


A word and nonword reading task was administered to all 
children, as well as serial and discrete measures of digit naming. 

Discrete word and nonword reading. The reading task con- 
sisted of 45 words and 45 nonwords varying in length from three 
to five letters. For each length, 15 monosyllabic words were 
selected from a corpus of child literature of two million tokens 
(Schrooten & Vermeer, 1994). Across lengths, words were 
matched on onset (i.e., the first letter) and frequency. The words 
ranged in frequency to reflect the variation in words children 
encounter (Mdn = 23, range: 1-148). Nonwords were created by 
interchanging onsets and rhymes of the words. For example, the 
words drift, front, and kramp (meaning urge, front, and cramp, 
respectively) were used to create the nonwords dront, framp, and 
krift. Therefore, words and nonwords were matched on onset and 
consonant—vowel structure. When the created nonword was un- 
pronounceable or also a Dutch word, one letter was changed in the 
rhyme. 

The reading task (as well as the discrete digit-naming task 
described below) was programmed in E-prime (Version 1.0; Sch- 
neider, Eschman, & Zuccolotto, 2002). Words and nonwords were 
presented one by one in the middle of a laptop screen (14.1 in.; 
35.8 cm) in 72-point Arial font. A plus sign presented for 750 ms 
focused attention. Then the word or nonword appeared, and chil- 
dren were asked to read it aloud as quickly and accurately as 
possible. A voice key registered naming latencies from the onset of 
stimulus presentation until the onset of the response. The experi- 
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menter registered naming accuracy on a response box (correct and 
valid, incorrect, or invalid). Words and nonwords were presented 
in blocks, separated by a fixed break of 1.5 min. The order of word 
and nonword reading was counterbalanced across the children. 

Digit naming. Naming of digits (1, 3, 5, 6, and 8) was 
administered in serial and discrete format. 

Serial digit naming. The five digits were presented 10 times 
in a random order on a sheet with five lines of 10 digits each (see 
Denckla & Rudel, 1976). Children were asked to name aloud all 
digits as quickly as possible. The time needed to name all 50 digits 
was converted to the number of digits named per second. 

Discrete digit naming. The 50 digits were also presented in a 
discrete naming task, in the same order as in the serial task. The 
digits were presented one by one in the middle of a laptop screen 
(14.1 in.; 35.8 cm) in 72-point Arial font. Each trial started with a 
plus sign, presented for 750 ms, to focus attention. Then the digit 
was presented and remained on the screen until the child made a 
response. A voice key registered response latencies from the onset 
of presentation until the onset of the response. The experimenter 
registered naming accuracy on a response box (correct and valid, 
incorrect, or invalid). The score consisted of the mean naming 
latency per digit, converted to the number of digits named per 
second. 


Procedure 


Children in second and fifth grade were tested in January/ 
February, when they had received approximately 1 year 5 months 
and 4 years 5 months of reading instruction, respectively. Third 
graders were tested in June/July, after approximately 3 years of 
reading instruction, meaning that the reading age of these children 
lay exactly between the reading ages of second and fifth graders. 
The word and nonword reading task and the digit naming tasks 
were administered during two waves of more extensive data col- 
lection. Second and fifth graders participated in a classroom ses- 
sion of about 1 hr 30 min and two individual sessions of approx- 
imately 30 min each. Third graders completed the experimental 
tasks during one individual session of approximately 40 min. 


Results 


Clustering Readers Based on Reading Processes 


As to be expected in a transparent orthography, mean accuracy 
across grades was high for both words (M = 0.95, SD = 0.07) and 
nonwords (M = 0.92, SD = 0.09). Reading latencies were ex- 


Table 1 


cluded from analysis if the voice key was not validly triggered 
(5.9%), if latencies were less than 250 ms or more than 6,000 ms 
(0.9%), and if latencies were more than 3 standard deviations from 
a participant’ s mean (1.6%). Similar to de Jong (2011), word and 
nonword reading latencies were converted into fluency scores 
reflecting the number of items read correctly per second. First, 
average word and nonword latencies were calculated for each child 
and transformed to the number of items read per second to nor- 
malize scores. Then, the proportion of words and nonwords read 
correctly was calculated over valid trials. Finally, word and non- 
word fluency scores were calculated by multiplying the number of 
items read per second by the proportion of items correct. For 
clarity purposes, we use the terms word reading fluency and 
nonword reading fluency to refer to these reading scores. How- 
ever, please note that the measures of reading fluency are based on 
discrete word and nonword reading tasks. 

Scores on word and nonword reading fluency, as well as on 
serial and discrete digit naming, were normally distributed in each 
grade separately and in the entire data set. All variables were 
inspected for univariate outliers (i.e., a score of more than 3 
standard deviations above or below the mean), separately for each 
grade. Two outliers (one in Grade 3 and one in Grade 5) were 
identified for word reading, one (in Grade 3) for nonword reading, 
two (one in Grade 3 and one in Grade 5) for serial digit naming, 
and two (one in Grade 3 and one in Grade 5) for discrete digit 
naming. These scores were coded as missing and not included in 
the analyses. None of the children was identified as a multivariate 
outlier. 

Descriptive statistics. The means and standard deviations on 
word and nonword reading fluency and serial and discrete digit 
naming for each grade are shown in Table 1. Overall, growth can 
be seen across grades. Both reading fluency and digit naming 
speed increased significantly between Grades 2 and 3. Between 
Grades 3 and 5, only discrete digit naming speed significantly 
increased. Across all grades, average reading fluency was lower 
than average digit naming speed. 

Correlations between word and nonword reading fluency and 
serial and discrete digit naming for each grade are shown in Table 
2. In Grade 2, word reading fluency correlated equally strongly 
with serial and discrete digit naming (Z = 0.701, p = .414). In 
Grades 3 and 5, however, word reading was more strongly related 
to discrete than to serial digit naming (Grade 3: Z = 2.216, p < 
.05; Grade 5: Z = 4.036, p < .001). Interestingly, a similar pattern 
was found in the correlations between nonword reading fluency 
and digit naming. In Grade 2, nonword reading fluency was related 


Means (and Standard Deviations) on Word and Nonword Reading Fluency, and Serial and 
Discrete Digit Naming in Items per Second in Grades 2, 3, and 5 








Grade 2 Grade 3 Grade 5 
(N = 117) (N = 86) (N = 111) t statistics t statistics 
Variable M (S) M (S) M (S) 2 vs. 3 3 vs. 5 
Word reading fluency 1.13 (.43) 1.62 (.25) 1.68 (.27) 10.284** 1.610 
Nonword reading fluency .99 (.43) 1.44 (.29) 1.46 (.34) 8.780"* 0.463 
Serial digit naming 1.75 (.39) 2.19 (.42) 2.27 (.45) 03m 1.241 
Discrete digit naming 1.69 (.26) 1.91 (.25) 2.00 (.31) 6.080** 20275 
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Table 2 


Correlations of Word and Nonword Reading Fluency With Serial and Discrete Digit Naming in 


Grades 2, 3, and 5 





Words Nonwords 
Digit Grade 2 Grade 3 Grade 5 Grade 2 Grade 3 Grade 5 
naming (N = 117) (N = 86) (N = 111) (N = 117) (N = 86) (N = 111) 
Serial Eos 21k 32" 564" 388 e aa 
Discrete .467™* 03h: .643™* ASA 444™ 437° 
0s nO 


equally strongly to serial and discrete digit naming (Z = 1.397, 
p = .163). In contrast to words, equal relations were also found in 
Grade 3 (Z = 1.024, p = .306). In Grade 5, however, the differ- 
ence of the correlation of nonword reading with discrete and serial 
digit naming approached significance, in favor of discrete digit 
naming (Z = 1.936, p = .053). 

A series of stepwise regression analyses was conducted to 
examine whether serial and discrete digit naming were indepen- 
dent predictors of reading fluency. The analyses were conducted 

for each grade, with word and nonword reading fluency as depen- 
dent variables. In the first set of analyses serial digit naming was 
entered first, and it was determined whether including discrete 
digit naming resulted in additional explained variance. In the 
second set, the order of serial and discrete digit naming was 
reversed. The (additional) variance explained in each step is pre- 
sented in Table 3. In Grade 2, both serial and discrete digit naming 
explained unique variance in word reading fluency. In Grades 3 
and 5, however, discrete digit naming was the stronger predictor, 
and serial digit naming did not explain additional variance. For 
nonword reading fluency, the results were the same, with the 
exception of a small independent effect of serial digit naming on 
nonword reading fluency in Grade 5. Interestingly, the results 
clearly show an increase in the amount of variance in reading 
fluency explained by discrete digit naming and a decrease in the 
amount of variance explained by serial digit naming. 

Classes of readers. The correlation patterns and regression 
results indicate that in the early stages of reading development 
(i.e., Grade 2) serial digit naming is the stronger correlate and 
predictor of reading fluency, whereas reading becomes more 
strongly related to discrete digit naming in the higher grades. This 
might suggest that two classes of readers could be found: one class 
for whom word reading is related more strongly to serial digit 
naming, and one class for whom word reading is related more 


Table 3 é' 

R? Changes in Hierarchical Regression Analyses Using Serial 
and Discrete Rapid Naming to Predict Word and Nonword 
Reading Fluency 


Words Nonwords 
Digit eee 
naming Grade 2 Grade 3 Grade5 Grade 2 Grade 3 Grade 5 
1. Serial 29 08" .04* 32" AAe* A0% 


2. Discrete  .06** .18** 39** 0S7F ee 24" 
1. Discrete ae 06s A055 Oa 205; 30* 
2. Serial lies 00 01 16% .02 .04** 
CnC ee bn eee Reh a ae eee 


Pp <05:: "tp OL: 


strongly to discrete digit naming. Alternatively, three classes of 
readers could be expected, when readers are better classified by 
grade. Therefore, both two- and three-class models were fitted and 
compared. Correlation patterns with nonword reading fluency sug- 
gest that similar clusters of children could be found based on the 
relations between nonword reading and digit naming. Therefore, 
the same models were estimated based on nonword reading, serial 
digit naming, and discrete digit naming." 

If a (categorical) variable is measured that can be the source of 
heterogeneity within a sample, this variable can be used to split 
participants into groups, and differences can be analyzed through 
multiple group analyses. If, however, the source of heterogeneity 
is hypothesized but unobserved, as are reading strategies in the 
current study, factor mixture modeling can be used to determine 
classes within a heterogeneous sample (Lubke & Muthén, 2005). 
Through factor mixture modeling, participants were clustered into 
unobserved (latent) classes based on mean scores on and correla- 
tions between a set of observed variables. Three variables were 
input for the current analyses: word or nonword reading, serial 
digit naming, and discrete digit naming. 

Models distinguishing between two classes and three classes 
were fitted using Mplus (Version 5.21; Muthén & Muthén, 2009). 
Several statistics can be obtained to evaluate model fit and decide 
on the number of classes. However, Nylund, Asparouhov, and 
Muthén (2007) showed that the Bayesian information criterion 
(BIC) and bootstrap likelihood ratio test (BLRT) should be fa- 
vored. Models with lower BIC values should be preferred. The 
BLRT p value indicates whether a model with k classes signifi- 
cantly improves fit over a model with k — 1 classes. In addition, 
entropy was used to evaluate the models, with a value close to 1 
indicating low average likelihoods that a child assigned to one 
class could have been assigned to another (Celeux & Soromenho, 
1996). 

For the word reading fluency models, the two-class model was 
favored over the three-class model, according to BIC (two classes: 
633.19; three classes: 648.42) and entropy (two classes: .878; three 
classes: .756). In addition, the BLRT indicated that the two-class 
model fitted significantly better than a one-class model (p < .001), 
but that a three-class model did not significantly improve fit over 
a two-class model (p = .92). The results of the two-class solution 
are presented in Table 4. For a large class of 277 children, word 


' Including word and nonword reading in one mixture model resulted in 
classes that did not fit expected correlation patterns and were difficult to 
interpret. This is possibly due to the high correlation between word and 
nonword reading. 
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Table 4 
Correlations of Serial and Discrete Digit Naming With Word 
and Nonword Reading Fluency in Classes of Readers 





Word reading fluency Nonword reading fluency 


Serial Parallel Serial Parallel 
Digit processors processors processors processors 
naming (N = 37) (N = 277) (N = 69) (N = 245) 
Serial Ooi .438* 545* .403* 
Discrete .462* .674* .483* 16137 
ape Olle 


reading correlated more strongly with discrete than with serial 
digit naming, suggesting that words are processed in parallel or 
read by sight. Children from each grade were assigned to this class 
of parallel processors (83 second, 84 third, and 110 fifth graders). 
However, for a smaller class of 37 children, word reading was 
most strongly related to serial digit naming, suggesting that words 
are not (yet) processed in parallel. This class of serial processors 
consisted mainly of children in Grade 2 (34 second graders, 2 third 
graders, and 1 fifth grader). 

For the nonword reading fluency models, the two-class model 
was favored over the three-class model according to BIC (two 
classes: 690.22; three classes: 704.49) but not according to entropy 
(two classes: .775; three classes: .836). The BLRT, however, 
indicated that the two-class model fitted significantly better than a 
one-class model (p < .001), but that a three-class model did not 
significantly improve fit over a two-class model (p = .89). More- 
over, one of the classes in the three-class solution included only 
nine children, and the interrelations among the variables within the 
classes were difficult to interpret. Therefore, the two-class solution 
seemed best. The results of the two-class solution are presented in 
Table 4. In line with the result for word reading, nonword reading 
correlated more strongly with discrete than serial digit naming for 
a large class of 245 children, suggesting that nonwords were 
processed in parallel. Parallel processors were identified in each 
grade (66 second, 79 third, and 100 fifth graders). For a smaller 
class of 69 children, nonword reading related more strongly to 
serial digit naming. This class of serial processors, who did not 
(yet) process nonwords in parallel, consisted mainly of children in 
Grade 2, although small groups of third and fifth graders were also 
assigned to this class (51 second graders, 7 third graders, and 11 
fifth graders). 


Table 5 


Length Effects 


If our interpretation of the classes of readers is correct, differ- 
ences would be expected across classes in length effects. Length 
effects are expected when letter strings are processed serially. 
Therefore, length effects were expected in the classes of readers 
who process words or nonwords serially, but not in the classes of 
readers who process words or nonwords in parallel. Accuracy rates 
and correct reading latencies to words and nonwords of three, four 
and five letters are presented in Table 5. 

Multilevel models were used to test differences in length effects 
(Snijders & Bosker, 1999). Within a multilevel model, random 
factors from participants and items can be captured within one 
model, instead of separate analyses (Quené & van den Bergh, 
2004). Each response to an item (Level 1) represents one case, but 
these cases are nested under individuals (Level 2). These models 
are equivalent to, for instance, the repeated measures analysis of 
variance but have more statistical power, because analyses are 
based on responses to all separate items instead of a mean score 
per participant per condition. 

The analyses were conducted with MLwiN 2.24 (Rasbash, 
Steele, Browne, & Goldstein, 2008). Separate models for words 
and nonwords were specified. In each model dummy variables for 
each length (three, four, or five letters) by class (serial, parallel 
processors) combination were computed, amounting to a total of 
six variables. To test the interactions of class and length as well as 
the main effect of length, length effects were split in two contrasts. 
These contrasts specified the differences between three versus four 
and four versus five letter items. The contrasts were tested simul- 
taneously in a multivariate test, using a chi-square statistic with 
two degrees of freedom (Tabachnick & Fidell, 2001). The main 
effects of class were tested with a single contrast, resulting in a 
chi-square statistic with one degree of freedom. 

First, a model was specified for accuracy rates. Because accu- 
racy was dummy coded (0 is incorrect, 1 is correct), a logistic 
regression procedure was used, assuming a binomial distribution 
rather than the normal distribution assumed for reaction latencies. 
Mean accuracy rates were high for both words (M = 0.95, SD = 
0.07) and nonwords (M = 0.92, SD = 0.09). However, serial 
processors were significantly less accurate than parallel processors 
for both words, x7(1) = 117.12, p < .001, and nonwords, y?(1) = 
135.03, p < .001. Length effects were found in the accuracy rates 
of both classes for both words (serial processors: x7(2) = 19.72, 
p < .001; parallel processors: x7(2) = 19.06, p < .001) and 


Accuracy Rates and Reading Latencies (and Standard Deviations) for 3-, 4-, and 5-Letter Words 


and Nonwords in Serial and Parallel Processors 








Words Nonwords 
Serial processors Parallel processors Serial processors Parallel processors 
(N = 37) (N = 277) (N = 69) (N = 245) 
Length Acc. RT Acc. Acc. RT Acc. RT 
3 letters .89(.10) 1,160(459) .97(.05) 607(109) .88(.11) 1,206 (436) .97(.05) 638 (102) 
4 letters .78(.17) 1,590 (704) 95.08) 639 (138) .77(.16) 1,572 (671) 93 (.09) 680 (131) 
S letters .82(.16) 1,952 (809) .96(.06) 669(159) .80(.18) 1,649(754) .94(.08) 702 (138) 


Note. Acc. = accuracy; RT = reaction time. 
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nonwords (serial processors: x7(2) = 48.40, p < .001; parallel 
processors: x7(2) = 58.15, p < .001). These length effects did not 
differ significantly between classes. The effects could mainly be 
ascribed to three-letter words and nonwords, which were read 
more accurately than both four- and five-letter items. 

The same model was specified for reading latencies. As can be 
seen in Table 5, large differences are found between classes in 
mean reading latencies. These differences in mean latencies might 
affect the interpretation of possible differences in length effects 
across classes. Significant differences can reflect absolute differ- 
ences in length effects but might also be merely proportional 
differences. Because we were interested in relative differences in 
the effect of length, we controlled for the differences in overall 
reading latencies by calculating within-subject z-scores (Faust, 
Balota, Spieler, & Ferraro, 1999). The subject’s overall mean 
reading latency was subtracted from every item’s reading latency. 
The difference was divided by the standard deviation of the sub- 
ject’s latency score distribution based on all 90 word and nonword 
items. 

As expected, length effects for words were larger in serial 
processors than in parallel processors, x*(2) = 64.15, p < .001. 
Unexpectedly, however, a separate test showed that the effect of 
length was significant in the parallel processors, y7(2) = 355.95, 
p < .001. For nonwords, length effects were also larger in serial 
processors than in parallel processors, x7(2) = 32.69, p < .001. 
Again, however, a significant length effect was also found for 
parallel processors, x?(2) = 283.36, p < .001. 

Two additional analyses were conducted to control for age and 
neighborhood size, respectively. The classes of word and nonword 
parallel processors included more of the older children, whereas 
the majority of the serial processors were children from Grade 2. 
To determine whether the differences in length effects between 
classes could be ascribed to age, we conducted the same analyses 
including only second graders. These children were more equally 
divided over the classes (words: serial processors N = 34, parallel 
processors N = 83; nonwords: serial processors N = 51, parallel 
processors N = 66) and did not differ in age (words: 8 years | 
month versus 8 years; nonwords: 8 years 1 month versus 7 years 
11 months). Nevertheless, the results in Grade 2 were the same as 
for the entire group. Length effects were larger in serial than in 
parallel processors (words: x7(2) = 35.01, p < .001; nonwords: 
x7(2) = 22.20, p < .001). Again, length effects were found in both 
classes of readers for both words (serial processors: 2) 
157.75, p < .001; parallel processors: x?(2) = 234.96, p < .001) 
and nonwords (serial processors: x7(2) = 184.90, p < .001; 
parallel processors: x7(2) = 110.00, p < .001). Thus, differences 
in the length effects of serial and parallel processors cannot be 
ascribed to differences in age between the classes. 

According to the PDP model, length effects could be ascribed 
to orthographic neighborhood size (Seidenberg & Plaut, 1998). 
Therefore, neighborhood size was added to the model for read- 
ing latencies as a covariate. Because the distribution of neigh- 
borhood size was skewed, a log-transformation was used and 
neighborhood size was standardized. As a result the estimates for 
neighborhood size can be interpreted as beta-coefficients. Four 
dummy variables were specified and added to the models for 
words and nonwords; the effect of neighborhood size on words and 
on nonwords in each class separately. The effect of neighborhood 
size on words was significant only for the parallel processors, B = 


—.06, x7(1) = 11.19, p < .001. Words with a larger neighborhood 
size were read faster than words with a smaller neighborhood size. 
The effect of neighborhood size on nonwords was significant for 
both serial processors, B = —.20, x7(1) = 22.75, p < .01, and 
parallel processors, B = —.18, x7(1) = 67.62, p < .001. Nonwords 
with a larger neighborhood size yielded shorter response latencies 
than nonwords with a smaller neighborhood size. Although length 
effects decreased when neighborhood size was controlled for, all 
length effects remained significant. Thus, length effects in word 
and nonword reading latencies could not (fully) be ascribed to 
neighborhood size. 


Cross Classification of Classes for 
Word and Nonword Reading 


We combined the classes that were identified in the separate 
word and nonword models. Interestingly, of the four possible 
classes, only three classes of readers emerged. The first class 
consisted of 36 children, who read both words and nonwords 
serially. These children relied on decoding for both types of letter 
strings. A second class of 244 children read both words and 
nonwords in parallel. Finally, 33 children read words in parallel 
but relied on serial processing for nonwords. Only one child was 
identified as a serial processor of words but parallel processor of 
nonwords, indicating that this fourth class of readers did not exist 
in the data. 


Discussion 


In the current study we used serial and discrete digit naming to 
examine serial and parallel processes in word and nonword read- 
ing. In line with the results of de Jong (2011), we found that the 
pattern in the correlations of discrete word reading with serial and 
discrete digit naming changes over time. From second to fifth 
grade the relation of discrete word reading with serial digit naming 
decreased, whereas its relation with discrete digit naming in- 
creased. A novel finding was that a similar pattern was found 
between the formats of digit naming and nonword reading. Re- 
gression analyses revealed that from second to fifth grade the 
amount of unique variance explained by discrete naming increased 
in both word and nonword reading. Previous studies have also 
shown that the relations of serial digit naming with word and 
nonword reading are similar, at least in more transparent orthog- 
raphies (Greek: Georgiou, Papadopoulos, Fella, & Parrila, 2012; 
German: Moll, Fussenegger, Willburger, & Landerl, 2009; Dutch: 
van den Boer et al., 2013). The current results indicate that the 
development of the relations with both discrete and serial naming 
over time is similar for words and nonwords. 

Next, as predicted, we identified two classes of readers for word 
reading based on the correlations with serial and discrete digit 
naming. In line with de Jong (2011), for a large class of readers, 
single word reading was strongly related to discrete digit naming. 
For these readers, the process of reading a single word mirrored 
naming of single overlearned symbols. Words, like digits, were 
read through parallel retrieval of phonological codes. For a second 
class of readers, however, single word reading was more strongly 
related to serial digit naming. For these readers, the process of 
reading a single word more closely resembled the serial naming of 
multiple overlearned symbols, suggesting that word reading in this 
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class relies on a serial process. As argued by de Jong (2011), these 
results for word reading are compatible with a word-specific view 
of reading development, as assumed for example in the DRC 
model (e.g., Coltheart et al., 2001), but cannot be explained within 
a PDP model (e.g., Plaut et al., 1996). 

A novel and unexpected finding was that the same classes were 
found for nonword reading. Strong correlations between nonword 
reading and serial digit naming were found for one class of readers, 
suggesting that nonwords were identified through serial reading 
processes. For a second and larger class of readers, however, 
nonword reading related most strongly to discrete digit naming, 
which suggests that the nonwords were processed in parallel. At 
first sight, these findings seem to be at odds with both the DRC 
model (e.g., Coltheart et al., 2001) and the PDP model (e.g., Plaut 
et al., 1996). For nonwords, both models predict one specific, 
although different, reading strategy. Whereas nonwords should be 
processed serially according to the DRC model, the PDP model 
predicts parallel processing of all letter strings, words and non- 
words alike. 

Moreover, a clear developmental pattern could be seen, al- 
though we were unable to study individuals’ stability of class 
assignment or transition across classes in the current cross- 
sectional study. First, although in latent class analysis, as opposed 
to group-wise comparisons, no assumptions have to be made about 
equal development of all children within an age group (see Bouw- 
meester & Verkoeijen, 2010), grade level was found to be a good 
proxy of the class assignments. The classes of serial processors of 
both words and nonwords consisted mainly of younger children 
from Grade 2. With just a few exceptions, all the older children in 
Grades 3 and 5 were able to process the words and nonwords in 
parallel. These results are in line with de Jong (2011), who iden- 
tified serial decoders among first and second graders, but not 
fourth graders. On the other hand, like Ehri and Wilce (1983), we 
also found parallel processing in young readers with limited read- 
ing experience. Even the poorer readers eventually read all words 
in parallel, since hardly any serial processors were identified past 
Grade 2. 

Second, when class assignments for word and nonword reading 
were combined, three classes of readers were identified: readers 
who processed both words and nonwords in parallel, readers who 
processed only words in parallel, and readers who relied on serial 
decoding for both words and nonwords. Importantly, the fourth 
possible group of readers, who process words serially but non- 
words in parallel, was not found. Together, these results suggest a 
developmental path. With increasing reading experience, a shift 
seems to occur from a serial decoding strategy to identify every 
letter string toward parallel processing of only words, and later on, 
even nonwords. 

An alternative interpretation of the classes and patterns of cor- 
relations in the current study could be the increasing differentiation 
of abilities over time. In other words, our discrete reading task 
becomes more strongly related to discrete digit naming because of 
similar format and task demands. However, if this interpretation is 
valid, a drop in the relation between serial and discrete digit 
naming would be expected. Our data do not show a difference in 
the relation between serial and discrete digit naming across classes 
of readers: .47 for serial word readers, and .44 for parallel word 
readers. A similar pattern was found by de Jong (2011), who 
reported correlations of .50 and .45 for serial and parallel proces- 


sors, respectively. In comparison, in that same study the relation 
between serial and discrete reading dropped from .80 to .32. In 
addition, increasing differentiation of abilities is most likely a 
gradual process. In light of a gradual differentiation process, it 
would not follow that at a certain point in time two classes could 
be distinguished for whom the tasks either are or are not differ- 
entiated. Probably, more than two classes would be found. 

To further support our interpretation of the classes of readers, 
length effects were examined. As predicted, for both words and 
nonwords we found that length effects were much larger in the 
classes denoted as serial processors than in the classes of parallel 
processors. This pattern of results was found both in the entire 
sample and in a separate analysis of second grade children (e., 
controlling for age differences). 

The larger sensitivity to word and nonword length in the class of 
serial decoders supports our interpretation of the reading strategy 
used. However, the small length effects in the classes denoted as 
parallel processors are not in accordance with the general idea that 
parallel processing of letter strings would result in the absence of 
a length effect. These findings could imply many different things. 
Of course, the results could indicate that our interpretation of the 
difference in reading strategies across the classes is incorrect. For 
several reasons, however, we think it safe to assume that small 
length effects can be observed in parallel processors. First, similar 
small length effects have been regularly reported in advanced adult 
readers (e.g., Balota, Cortese, Sergent-Marshall, Spieler, & Yap, 
2004; Bates, Burani, d’Amico, & Barca, 2001; Ziegler, Perry, 
Jacobs, & Braun, 2001) and children (Ziegler et al., 2003), all of 
whom are assumed to use parallel processing. Such small length 
effects could reflect the involvement of the nonlexical route. 
Although parallel activation of phonology is the dominant reading 
strategy, letter strings are simultaneously processed through the 
nonlexical route. Possibly, the nonlexical route contributed to the 
identification of at least some of the items. In addition, a small 
percentage of serial decoders could have been erroneously as- 
signed to the class of parallel processors, which could also result 
in small length effects. Alternatively, the findings could add to 
previous indications that length effects cannot be uniformly as- 
cribed to serial decoding of letters into phonological codes (e.g., 
Risko et al., 2011; van den Boer et al., 2012). In several compu- 
tational models, length effects are also not ascribed to serial 
activation of phonology. Within PDP models, for example, length 
effects are assumed to reflect visual and articulatory factors or 
neighborhood size (Seidenberg & Plaut, 1998). Alternatively, in 
more recent connectionist dual process models (CDP*: Perry, 
Ziegler, & Zorzi, 2007; CDP**: Perry, Ziegler, & Zorzi, 2010), 
graphemes are serially connected to the onset, vowel, or coda 
position in a graphemic buffer. Subsequently, phonology for the 
input in the graphemic buffer is activated in parallel, either through 
the lexical route or through a sublexical parallel network of ortho- 
graphic and phonological units. 

In line with this final point, our results do raise the more general 
question of what is initially processed serially. In line with the 
DRC framework (e.g., Coltheart et al., 2001; Pritchard, 2012), we 
have interpreted serial processing as serial activation of phonology 
through grapheme-phoneme conversion along the nonlexical route. 
However, serial processing could also occur at the preceding level 
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of letter identification.? Within the DRC model, as a model of 
skilled reading, letter features and identities are always identified 
in parallel. In their work on the causes of letter-by-letter dyslexia, 
however, Fiset and colleagues (e.g., Fiset, Arguin, Bub, Hum- 
phreys, & Riddoch, 2005; Fiset, Arguin, & McCabe, 2006; Fiset, 
Gosselin, Blais, & Arguin, 2006) highlight that serial reading 
processes can also occur at the level of letter encoding. When 
presented with words, readers who suffer from letter-by-letter 
dyslexia experience an abnormally low signal-to-noise ratio. As a 
result, these readers present with an impairment at the letter 
encoding level because visual features of individual letters cannot 
be registered with enough precision to activate the corresponding 
letter identities in parallel. Consequently, readers rely on a com- 
pensatory sequential letter processing strategy, and focus on each 
letter separately to achieve the increase in the resolution of the 
visual system necessary to encode the letter. Possibly, our younger 
readers, similar to readers who suffer from letter-by-letter dys- 
lexia, were unable to encode the letter strings in parallel and 
instead processed letters sequentially, irrespective of how phonol- 
ogy was subsequently activated. With increasing reading experi- 
ence, readers might develop the skills necessary for parallel letter 
identification, as seen in adults. 

This alternative interpretation could account for several of our 
findings, such as the fact that even among beginning readers, 
relatively few children were identified as serial processors. It 
would also be less surprising that similar shifts from serial toward 
parallel processing were seen in both word and nonword reading. 
Letter identification should be similar for both types of letter 
strings. Interestingly, interpreting our results in terms of develop- 
ment in letter processing skills would mean that our findings could 
be in line with the DRC model. Our idea that nonword phonology 
could be activated in parallel would be at odds with the DRC 
model, according to which nonwords are predominately processed 
through the nonlexical route. If, however, our findings on reading 
development in children should be interpreted in terms of devel- 
opment in letter identification processes, they could easily be 
accommodated within the DRC model with the addition of a 
developmental process in the initial stage of letter identification. 

Some of our findings, however, appear difficult to explain 
through increases in parallel letter encoding, such as the develop- 
mental trends indicating parallel processing of words before non- 
words. A specific group of children was identified who appeared 
to process words in parallel, but nonwords serially. If it is letter 
features and identities that are increasingly processed in parallel, 
no differences should be expected in the way words and nonwords 
are processed, given that the initial stage of visual feature and letter 
encoding is the same for all letter strings. Furthermore, the corre- 
lation of word reading with discrete digit naming seems difficult to 
interpret. This relation was significant for both serial and parallel 
processors and appeared to increase when words are processed in 
parallel. Since only a single digit is presented in a discrete naming 
task, the task cannot reflect parallel identification of multiple 
items. It could be argued, however, that it is not the number of 
items that is essential in this relation but rather parallel activation 
of all the features of an item, be that a single digit or multiple 
letters. Nevertheless, although visual feature identification could 
account for some individual differences in discrete digit naming, it 
is unlikely to account for the relatively high correlation with 
reading, given the general agreement that discrete digit naming 


reflects the retrieval of phonological codes from memory (Bowers 
& Swanson, 1991; Jones, Branigan, & Kelly, 2009; Logan & 
Schatschneider, 2014). Taken together, it seems difficult to deter- 
mine exactly what is initially processed serially. Future studies 
could help to examine whether it is mainly letters, mainly phono- 
logical codes, or both that are increasingly activated in parallel. 

Admittedly, the approach taken in the current study adopts 
assumptions and has limitations that should be mentioned. First, 
we have to acknowledge that in the current study only short, 
regular monosyllabic words were studied. The focus on monosyl- 
labic words fits well with the models of the reading system that 
were studied. Both the DRC (e.g., Coltheart et al., 2001) and PDP 
(e.g., Plaut et al., 1996) models focus on monosyllabic word 
reading. The question remains, however, whether the shift from 
serial toward parallel processing can only be found in short words 
or could also be seen in longer monosyllabic or in polysyllabic 
words. In addition, the nonwords in the current study were con- 
structed by interchanging onsets and rhymes of the words. Possi- 
bly, nonwords were processed like words, because of their high 
resemblance to words. Future studies might include multiple sets 
of nonwords, varying in their similarity to words. 

Another limitation lies in the tasks used in the current study. We 
included only a discrete reading task. Thus, our results cannot be 
generalized to serial reading tasks. We also made specific choices 
in the scoring of the discrete naming and reading tasks. The 
reaction latencies obtained in the naming tasks, which are a mea- 
sure of time, were converted to fluency scores, a measure of speed. 
This transformation was chosen to correct for the skewed distri- 
butions of reading latencies (Ratcliff, 1993). Our results are not 
expected to be different, however, if reaction latencies are used, 
since a high correlation (r > .80) was found between fluency 
scores and reaction latencies for both word and nonword reading. 
Moreover, the fluency scores as obtained from the discrete word 
reading task mainly reflect accuracy and automaticity in sublexical 
and lexical processes, which could also be referred to as reading 
rate. Our definition therefore differs from fluency measures based 
on text reading, when for example prosody or comprehension can 
also be taken into account (e.g., Kuhn, Schwanenflugel, & Meis- 
inger, 2010; Wolf & Katzir-Cohen, 2001). Furthermore, the dis- 
crete reading and digit naming task were presented on a computer 
screen, but the serial naming task was not. However, we do not 
think this had a major effect on our results. Protopapas, Altani, and 
Georgiou (2013) administered both serial and discrete naming 
tasks on a computer and found similar relations with word reading 
as in the current study. 

Finally, we chose to include naming of digits rather than letters. 
When studying word reading, letter naming might seem the more 
obvious choice. Digits were chosen, however, because digit names 
were expected to be even more well known by the children, 
especially in second grade. In the Netherlands, the names of letters 
are learned after letter sounds. Digit names are acquired earlier. 
Moreover, in Dutch, digit names are monosyllabic words, similar 
to the items in the reading task. However, results are not expected 
to be different if letters are used. De Jong (2011) presented 
correlations of discrete word reading with both digit and letter 


? We thank an anonymous reviewer for bringing this alternative expla- 
nation to our attention. 
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naming and showed that past Grade 1, relations of letter and digit 
naming with word reading were found to be almost identical. 

Taken together, the results suggest that readers can be sorted 
into latent classes of serial and parallel processors in reading single 
monosyllabic words and nonwords based on the relations with 
serial and discrete digit naming. The different classes were vali- 
dated by large differences in sensitivity to word and nonword 
length. Together, the different classes identified suggest a devel- 
opmental shift from reading all letter strings serially toward par- 
allel processing of words, and later on nonwords. These findings 
possibly challenge current models of the reading system (e.g., 
Coltheart et al., 2001; Plaut et al., 1996) and highlight the need for 
models of the reading system that can accommodate developmen- 
tal changes from initial serial processing, toward later parallel 
processing of all letter strings. 


References 


Aaron, P. G., Joshi, R. M., Ayotollah, M., Ellsberry, A., Henderson, J., & 
Lindsey, K. (1999). Decoding and sight-word naming: Are they inde- 
pendent components of word recognition skill? Reading and Writing, 11, 
89-127. doi:10.1023/A: 1008088618970 

Balota, D. A., Cortese, M. J., Sergent-Marshall, S. D., Spieler, D. H., & 
Yap, M. J. (2004). Visual word recognition of single-syllable words. 
Journal of Experimental Psychology: General, 133, 283-316. doi: 
10.1037/0096-3445.133.2.283 

Bates, E., Burani, C., d’ Amico, S., & Barca, L. (2001). Word reading and 
picture naming in Italian. Memory & Cognition, 29, 986-999. doi: 
10.3758/BF03 195761 

Bouwmeester, S., & Verkoeijen, P. P. J. L. (2010). Latent variable mod- 
eling of cognitive processes in true and false recognition of words: A 
developmental perspective. Journal of Experimental Psychology: Gen- 
eral, 139, 365-381. doi:10.1037/a0019301 

Bowers, P. G., & Swanson, L. B. (1991). Naming speed deficits in reading 
disability: Multiple measures of a singular process. Journal of Experi- 
mental Child Psychology, 51, 195-219. doi:10.1016/0022-0965(91) 
90032-N 

Brus, B., & Voeten, B. (1995). Eén minuut test vorm A en B. Verantwoord- 
ing en handleiding [One-Minute Test, Forms A and B: Justification and 
manual]. Lisse, the Netherlands: Swets & Zeitlinger. 

Celeux, G., & Soromenho, G. (1996). An entropy criterion for assessing 
the number of clusters in a mixture model. Journal of Classification, 13, 
195-212. doi:10.1007/BF01246098 

Coltheart, M., Rastle, K., Perry, C., Langdon, R., & Ziegler, J. (2001). 
DRC: A dual route cascaded model of visual word recognition and 
reading aloud. Psychological Review, 108, 204—256. doi:10.1037/0033- 
295X.108.1.204 

de Jong, P. F. (2011). What discrete and serial rapid automatized naming 
can reveal about reading. Scientific Studies of Reading, 15, 314-337. 
doi: 10.1080/10888438.2010.485624 

Denckla, M. B., & Rudel, R. (1976). Naming of object-drawings by 
dyslexic and other learning disabled children. Brain and Language, 3, 
1-15. doi:10.1016/0093-934X(76)90001-8 

Ehri, L. C. (2005). Learning to read words: Theory, findings, and issues. 
Scientific Studies of Reading, 9, 167-188. doi:10.1207/s1532799 
xssr0902_4 

Ehri, L. C., & Wilce, L. S. (1983). Development of word identification 
speed in skilled and less skilled beginning readers. Journal of Educa- 
tional Psychology, 75, 3-18. doi:10.1037/0022-0663.75.1.3 

Faust, M. E., Balota, D. A., Spieler, D. H., & Ferraro, F. R. (1999). 
Individual differences in information-processing rate and amount: Im- 
plications for group differences in response latency. Psychological Bul- 
letin, 125, 777-799. doi:10.1037/0033-2909.125.6.777 


Fiset, D., Arguin, M., Bub, D., Humphreys, G. W., & Riddoch, M. J. 
(2005). How to make the word-length effect disappear in letter-by-letter 
dyslexia: Implications for an account of the disorder. Psychological 
Science16, 535-541. doi:10.1111/j.0956-7976.2005.01571.x 

Fiset, D., Arguin, M., & McCabe, E. (2006). The breakdown of parallel 
letter processing in letter-by-letter dyslexia. Cognitive Neuropsychology, 
23, 240-260. doi:10.1080/02643290442000437 

Fiset, D., Gosselin, F., Blais, C., & Arguin, M. (2006). Inducing letter-by- 
letter dyslexia in normal readers. Journal of Cognitive Neuroscience, 18, 
1466-1476. doi:10.1162/jocn.2006.18.9.1466 

Forster, K. I., & Chambers, S. M. (1973). Lexical access and naming time. 
Journal of Verbal Learning & Verbal Behavior, 12, 627-635. doi: 
10.1016/S0022-537 1(73)80042-8 

Frederiksen, J. R., & Kroll, J. E. (1976). Spelling and sound: Approaches 
to the internal lexicon. Journal of Experimental Psychology: Human 
Perception and Performance, 2, 361-379. doi:10.1037/0096-1523.2.3 
361 i‘ 

Georgiou, G. K., Papadopoulos, T. C., Fella, A., & Parrila, R. (2012). 
Rapid naming speed components and reading development in a consis- 
tent orthography. Journal of Experimental Child Psychology, 112, \-17. 
doi:10.1016/j.jecp.2011.11.006 

Jackson, N. E., & Coltheart, M. (2001). Routes to reading success and 
failure: Toward an integrated cognitive psychology of atypical reading. 
New York, NY: Taylor & Francis. 

Jones, M. W., Branigan, H. P., & Kelly, M. L. (2009). Dyslexic and 
nondyslexic reading fluency: Rapid automatized naming and the impor- 
tance of continuous lists. Psychonomic Bulletin & Review, 16, 567-572. 
doi:10.3758/PBR.16.3.567 

Kuhn, M. R., Schwanenflugel, P. J., & Meisinger, E. B. (2010). Aligning 
theory and assessment of reading fluency: Automaticity, prosody, and 
definitions of fluency. Reading Research Quarterly, 45, 230-251. doi: 
10.1598/RRQ.45.2.4 

Logan, J. A. R., & Schatschneider, C. (2014). Component processes in 
reading: Shared and unique variance in serial and isolated naming speed. 
Reading and Writing, 27, 905-922. doi:10.1007/s11145-013-9475-y 

Lubke, G. H., & Muthén, B. O. (2005). Investigating population hetero- 
geneity with factor mixture models. Psychological Methods, 10, 21-39. 
doi: 10.1037/1082-989X.10.1.21 

Marinus, E., & de Jong, P. F. (2010). Variability in the word-reading 
performance of dyslexic readers: Effects of letter length, phoneme length 
and digraph presence. Cortex, 46, 1259-1271. doi:10.1016/j.cortex 
.2010.06.005 

Moll, K., Fussenegger, B., Willburger, E., & Landerl, K. (2009). RAN is 
not a measure of orthographic processing: Evidence from the asymmet- 
ric German orthography. Scientific Studies of Reading, 13, 1-25. doi: 
10.1080/1088843080263 1684 

Muthén, L. K., & Muthén, B. O. (2009). Mplus (Version 5.21) [Computer 
software]. Los Angeles, CA: Muthén & Muthén. 

Nylund, K. L., Asparouhov, T., & Muthén, B. O. (2007). Deciding on the 
number of classes in latent class analysis and growth mixture modeling: 
A Monte Carlo simulation study. Structural Equation Modeling, 14, 
535-569. doi:10.1080/107055 10701575396 

Perry, C., Ziegler, J. C., & Zorzi, M. (2007). Nested incremental modeling 
in the development of computational theories: The CDP model of 
reading aloud. Psychological Review, 114, 273-315. doi:10.1037/0033- 
295X.114.2.273 

Perry, C., Ziegler, J. C., & Zorzi, M. (2010). Beyond single syllables: 
Large-scale modeling of reading aloud with the Connectionist Dual 
Process (CDP**) model. Cognitive Psychology, 61, 106-151. doi: 
10.1016/j.cogpsych.2010.04.001 

Plaut, D. C. (1999). A connectionist approach to word reading and acquired 
dyslexia: Extension to sequential processing. Cognitive Science, 23, 
543-568. doi:10.1207/s15516709c0g2304_7 


PARALLEL AND SERIAL READING PROCESSES 151 


Plaut, D. C., McClelland, J. L., Seidenberg, M. S., & Patterson, K. (1996). 
Understanding normal and impaired word reading: Computational prin- 
ciples in quasi-regular domains. Psychological Review, 103, 56-115. 
doi: 10.1037/0033-295X.103.1.56 

Pritchard, S. C. (2012). Incorporating learning mechanisms into the dual- 
route cascaded (DRC) model of reading aloud and word recognition. 
(Unpublished doctoral dissertation). Macquarie University, Sydney, 
Australia. 

Protopapas, A., Altani, A., & Georgiou, G. K. (2013). Development of 
serial processing in reading and rapid naming. Journal of Experimental 
Child Psychology, 116, 914-929. doi:10.1016/j.jecp.2013.08.004 

Quené, H., & van den Bergh, H. (2004). On multi-level modeling of data 
from repeated measures designs: A tutorial. Speech Communication, 43, 
103-121. doi:10.1016/j.specom.2004.02.004 

Rasbash, J., Steele, F., Browne, W. J., & Goldstein, H. (2008). A user’s 
guide to MLwiN (Version 2.10). Bristol, England: University of Bristol, 
Centre for Multilevel Modelling. 

Ratcliff, R. (1993). Methods for dealing with reaction time outliers. Psy- 
chological Bulletin, 114, 510-532. doi:10.1037/0033-2909.114.3.510 
Risko, E. F., Lanthier, S. N., & Besner, D. (2011). Basic processes in 
reading: The effect of interletter spacing. Journal of Experimental Psy- 
chology: Learning, Memory, and Cognition, 37, 1449-1457. doi: 

10.1037/a0024332 

Schneider, W., Eschman, A., & Zuccolotto, A. (2002). E-prime: User’s 
guide. Pittsburgh, PA: Psychology Software. 

Schrooten, W., & Vermeer, A. (1994). Woorden in het basisonderwijs: 
15.000 woorden aangeboden aan leerlingen. [Words in primary educa- 
tion: 15,000 words presented to students]. Tilburg, the Netherlands: 
Tilburg University Press. 

Seidenberg, M. S., & McClelland, J. L. (1989). A distributed, develop- 
mental model of word recognition and naming. Psychological Review, 
96, 523-568. doi: 10.1037/0033-295X.96.4.523 

Seidenberg, M. S., & Plaut, D. C. (1998). Evaluating word-reading models 
at the item level: Matching the grain of theory and data. Psychological 
Science, 9, 234-237. doi:10.1111/1467-9280.00046 

Share, D. L. (1995). Phonological recoding and self-teaching: Sine qua non 
of reading acquisition. Cognition, 55, 151-218. doi:10.1016/0010- 
0277(94)00645-2 

Share, D. L. (1999). Phonological recoding and orthographic learning: A 
direct test of the self-teaching hypothesis. Journal of Experimental Child 
Psychology, 72, 95-129. doi:10.1006/jecp.1998.2481 

Share, D. L. (2008). On the anglocentricities of current reading research 
and practice: The perils of overreliance on an “outlier” orthography. 


Psychological Bulletin, 134, 584-615. doi:10.1037/0033-2909.134.4 
584 

Snijders, T., & Bosker, R. (1999). Multilevel modeling: An introduction to 
basic and advanced multilevel modeling. London, England: Sage. 

Spinelli, D., de Luca, M., Filippo, G. D., Mancini, M., Martelli, M., & 
Zoccolotti, P. (2005). Length effect in word naming in reading: Role of 
reading experience and reading deficit in Italian readers. Developmental 
Neuropsychology, 27, 217-235. doi:10.1207/s15326942dn2702_2 

Tabachnick, B. G., & Fidell, L. S. (2001). Using multivariate statistics (4th 
ed.). Boston, MA: Allyn & Bacon. 

van den Boer, M., de Jong, P. F., & Haentjens-van Meeteren, M. M. 
(2012). Lexical decision in children: Sublexical processing or lexical 
search? The Quarterly Journal of Experimental Psychology, 65, 1214—- 
1228. doi:10.1080/17470218.2011.652136 

van den Boer, M., de Jong, P. F., & Haentjens-van Meeteren, M. M. 
(2013). Modeling the length effect: Specifying the relation with visual 
and phonological correlates of reading. Scientific Studies of Reading, 17, 
243-256. doi:10.1080/10888438.2012.683222 

Van den Bos, K. P., Zijlstra, B. J. H., & Van den Broeck, W. (2003). 
Specific relations between alphanumeric-naming speed and reading 
speeds of monosyllabic and multisyllabic words. Applied Psycholinguis- 
tics, 24, 407—430. doi:10.1017/S0142716403000213 

Weekes, B. S. (1997). Differential effects of number of letters on word and 
nonword naming latency. The Quarterly Journal of Experimental Psy- 
chology A: Human Experimental Psychology, 50, 439-456. doi: 
10.1080/713755710 

Wolf, M., & Katzir-Cohen, T. (2001). Reading fluency and its intervention. 
Scientific Studies of Reading, 5, 211-239. doi:10.1207/S15327 
99XSSRO0503_2 

Ziegler, J. C., Perry, C., Jacobs, A. M., & Braun, M. (2001). Identical 
words are read differently in different languages. Psychological Science, 
12, 379-384. doi:10.1111/1467-9280.00370 

Ziegler, J. C., Perry, C., Ma-Wyatt, A., Ladner, D., & Schulte-Korne, G. 
(2003). Developmental dyslexia in different languages: Language spe- 
cific or universal? Journal of Experimental Child Psychology, 86, 169— 
193. doi:10.1016/S0022-0965(03)00139-5 

Zoccolotti, P., de ‘Luca, M., di Pace, E., Gasperini, F., Judica, A., & 
Spinelli, D. (2005). Word length effect in early reading and in develop- 
mental dyslexia. Brain and Language, 93, 369-373. doi:10.1016/j.bandl 
.2004.10.010 


Received May 24, 2013 
Revision received March 6, 2014 
Accepted April 20, 2014 @ 


Journal of Educational Psychology 
2015, Vol. 107, No. 1, 152-169 


© 2014 American Psychological Association 
0022-0663/15/$12.00 http://dx.doi.org/10.1037/a0036897 


Classmate Characteristics and Student Achievement in 33 Countries: 
Classmates’ Past Achievement, Family Socioeconomic Status, Educational 
Resources, and Attitudes Toward Reading 


Ming Ming Chiu 
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Bonnie Wing-Yin Chow 
City University of Hong Kong 


Classmates can influence a student’s academic achievement through immediate interactions (e.g., 
academic help, positive attitudes toward reading) or by sharing tangible or intangible family resources 
(books, stories of foreign travel). Multilevel analysis of 141,019 fourth-grade students’ reading achieve- 
ments in 33 countries showed that classmates’ family factors (parent socioeconomic status [SES], home 
educational resources) were more strongly related to a student’s reading achievement than were 
classmates’ characteristics (parent ratings of past literacy skills, attitudes toward reading). However, 
these classmate links to reading achievement differed across students (e.g., high-SES classmates 
benefited high-SES students more than low-SES students). Also, links between classmates’ past reading 
achievement and a student’s current reading achievement were stronger in countries that were richer, 
were more collectivist, or avoided uncertainty less. These findings show how an ecological model of 
family and classmate microsystems, classmate family mesosystem, and country macrosystem can help 


provide a comprehensive account of children’s academic achievement. 


Keywords: Bronfenbrenner, ecologicai system theory, classmates, literacy, cross-cultural study 


Classmates play a significant role in children’s behaviors and 
academic achievement (Opdenakker & Van Damme, 2001). Spe- 
cifically, a student with higher achieving classmates often shows 
higher academic achievement than does a student with lower 
achieving classmates (Kang, 2007; Zimmer & Toma, 2000). These 
classmates’ influences contribute to a complex environmental sys- 
tem in which various levels of factors interact. However, past 
studies have not explicated the mechanisms by which classmates 
affect a student’s learning. Furthermore, they have not tested 
whether these links are universal or whether they differ across 
countries. Therefore, the present study helps fill these research 
gaps by testing a model of how family, classmate, and country 
characteristics are related to academic achievement among ele- 
mentary school children. The proposed model was supported 
through the reading tests and associated survey data of a represen- 
tative sample of 141,019 fourth graders in 33 countries. 


Environmental Influences on Academic Achievement 


Children develop within a complex environment that consists of 
multiple levels of surrounding contexts, according to Bronfen- 
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brenner’s (2005) ecological system theory. The immediate con- 
texts (microsystems; e.g., family, classroom), the relationships 
between microsystems (mesosytem; e.g., classmates’ families) and 
the broader country resources (macrosystem; e.g., economy, cul- 
tural values) can contribute to student learning (see Figure 1). 
While some relationships are universal, others differ across coun- 
tries. 


Microsystem: Family 


Students in higher socioeconomic status (SES) families (or 
high-SES students) have higher academic achievement, even after 
accounting for genetic factors (e.g., Walker, Petrill, & Plomin, 
2005). Families can use their financial, human, social, and cultural 
capital to give their children learning opportunities (Chiu, 2013). 
Specifically, families with more money (financial capital) can buy 
more educational resources (books, calculators, and so on) to 
create a richer learning environment (Chiu, 2010). Furthermore, 
high-SES students often spend more time with their parents (due to 
fewer competing siblings, less parent time on housework, and 
multitasking parents), so they can capitalize more on their parents’ 
human, social, and cultural capital. Families with more education, 
knowledge, and skills (human capital) often create better learning 
environments for their children, foster better attitudes toward read- 
ing, and teach them more skills than other families can (Davalos, 
Chavez, & Guardiola, 2005; Willms, 1999). 

High-SES families also often have substantial social or cultural 
capital. High-SES families typically have large social networks of 
relatives, friends, and acquaintances with skills or resources (so- 
cial capital) that can help their children learn (Chiu, 2013). Like- 
wise, high-SES families often have cultural possessions or expe- 
riences (cultural capital) that can help their children learn their 
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Figure 1. 
SES = socioeconomic status; GDP = gross domestic product. 


society’s cultural knowledge, skills, and values to adapt to their 
school culture (Chiu & Chow, 2010). 

In short, high-SES students have more financial, human, social, 
or cultural capital. Using their greater capital, higher SES students 
can better understand others’ expectations, behave properly at 
school, have closer relationships with teachers and classmates, and 
learn more in school than lower SES students do. 


Microsystem: Classmates 


Children spend a large proportion of time at school and interact 
regularly with their classmates. If a student has a larger social 
network of classmates or stronger relationships with them, he or 
she has more social capital on which to capitalize and learn more 
(Pong, 1997, 1998). Moreover, high-SES students might use their 
superior resources and skills to benefit more from classmates’ 
resources and experiences. 

Classmate benefits. Classmates can help a student learn both 
directly and indirectly (Skibbe, Phillips, Day, Brophy-Herb, & 
Connor, 2012). Classmates can directly help a student by sharing 
information or evaluations. For example, a classmate can explain 
the meaning of a vocabulary word (Murphey, 1994). Also, when a 
student proposes an idea, a classmate can recognize its validity and 
further justify it (Chiu, 2008) or identify its flaws and correct it 
(Chiu & Khoo, 2003). Thus, classmates can provide information, 
validate correct ideas, or recognize flaws to help a student learn. 

Classmates can also help a student learn indirectly through 
motivation and norms. They can motivate a student to enjoy 
learning, which helps them exert effort and persevere when facing 
setbacks (Chiu & McBride-Chang, 2006). For example, a class- 
mate can show enthusiasm for a storybook character, which can 
entice a student to study together (Edmunds & Bauserman, 2006; 
Guthrie, Klauda, & Morrison, 2012; Skibbe et al., 2012). When a 
student performs poorly on a reading test, a classmate can provide 
emotional support and encourage further study so that the student 
can do better on the next test (Guthrie et al., 2012; Skibbe et al., 
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Model of classmates’ effects on students’ reading achievement (control variables are not shown). 


2012). Hence, classmates can motivate a student via greater en- 
joyment, study time, and perseverance. 

In addition to motivation, classmates can help create and main- 
tain norms of attitude, behavior, and achievement, both among 
friends and within the classroom. Classmates can articulate and 
model positive academic attitudes, behave within discipline norms, 
study hard, and perform well on tests, essays, and other academic 
measures. Together, classmates can cultivate a culture of positive 
attitudes toward reading in which to immerse a student (Johnson & 
Johnson, 1999). Supported by these positive attitudes, classmates 
can model appropriate classroom behavior (e.g., raise hands to 
answer teacher questions, rather than interrupting) and encourage 
a student to behave accordingly (Lewis, 2001; Ma & Willms, 
2004). Buttressed by these attitudes and behaviors, classmates are 
more likely to have higher reading achievement, which raises a 
student’s academic expectations (Baker & Wigfield, 1999). 

Unequal benefits. However, classmates do not benefit each 
student equally. While high-SES students can share some re- 
sources publicly, they may share other resources privately. Pri- 
vately shared resources are an incentive for classmates to try to 
befriend high-SES students rather than low-SES students. As a 
result, high-SES students often attract more and build stronger 
relationships with them. Hence, high-SES students might build and 
capitalize on a stronger network of classmates to learn more (Ryan, 
2001). 

As higher SES students may have more capital, higher status, 
and better interaction skills than other students, they can entice 
high-SES classmates into their network and build closer relation- 
ships. As noted earlier, a high-SES student often learns strong 
social skills from family members and can use them to build closer 
relationships with classmates (Chiu & Chow, 2010). As people 
prefer to interact with those who are similar to them (common 
language, similar activities, similar preferences, and so on; ho- 
mophily bias, McPherson, Smith-Lovin, & Cook, 2001; or assor- 
tativeness, Kindermann, 2007), high-SES classmates often share more 
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common experiences with high-SES students and prefer to interact with 
them, compared with low-SES students (Chiu, in press). Hence, high- 
SES students often have stronger social networks with high-SES 
classmates and capitalize on them to learn more than other students 
(Crosnoe, 2004). 

Moreover, not all interactions with classmates necessarily en- 
hance a student’s learning. Social capital is positively related with 
academic achievement among majority populations (e.g., native- 
born students; Coleman, 1994) but not among minority or margin- 
alized populations (e.g., ethnic minority), which often have less 
social capital (negative or nonsignificant relations; Ream, 2003). 
As students in marginalized groups often devote extensive effort to 
develop tight ties within their lower social capital group, they have 
less time and effort to develop ties with higher social capital 
groups (Ream, 2003). Therefore, students in marginalized groups 
often draw upon less social capital (or even suffer negative effects 
through poor academic attitudes, behaviors, or norms; Ream, 
2003). Together, these studies suggest that the link between social 
capital and academic achievement depends on the student’s own 
resources, attitudes, and skills. 

Also, students with more of a specific resource, attitude, or skill 
might benefit more from that of their classmates. These dimension- 
specific benefits might operate through selective trading, norm 
establishment/maintenance, or skill practice. For example, a stu- 
dent with more books (or other educational resources) might trade 
or lend them to classmates for other books (Mankiw, 2011). In this 
way, richer students with more resources have greater access than 
poorer students to classmates’ resources and can benefit more from 
them. Classmate-established norms (e.g., lunch chats about story 
books) might help a student who enjoys reading to learn more than 
a student who does not enjoy reading (Chiu & McBride-Chang, 
2006). Also, classmates with a skill (e.g., read English) might trade 
more information with a student with some of that skill (knows 
some words) than with another student without any of it (knows no 
words). Or classmates with that skill can help a student with some 
of that skill practice and thereby improve it more than someone 
who lacks the skill entirely. 

In short, classmates are part of a student’s social capital. Class- 
mates can help a student learn directly (sharing information, eval- 
uations) or indirectly (motivation, norms). Moreover, students in 
higher SES families or in the majority population often benefit 
more from classmate resources and experiences than other stu- 
dents. Also, students with more of a resource, skill, or attitude 
might benefit more from that of their classmates via selective 
trading, participation within the norm, or skill practice. 


Mesosystem: Classmates’ Families 


The mesosystem of students and classmates’ families link the 
family and classmate microsystems. In addition to interactions 
with classmates, their family capital can help a student learn more 
(Caldas & Bankston, 1997; Chiu & Zeng, 2008; Sallee & Tierney, 
2007). A student can benefit from interactions with classmates’ 
families (e.g., chatting with a classmate’s mom about a mayoral 
election), from their home resources (e.g., borrow poetry book), 
from their school contributions (e.g., guest talks), or from their 
interactions with school staff (e.g., active parent-teacher organi- 
zation). Thus, the human, financial, cultural, or social capital of 
classmates’ families might influence a student’s learning. 


However, classmates’ families do not necessarily help all stu- 
dents equally. As noted previously, high-SES students tend to have 
stronger social networks of classmates, especially high-SES class- 
mates (Ryan, 2001). Thus, they are more likely to capitalize on 
their classmates’ family capital. Furthermore, marginalized fami- 
lies often have less family capital, so marginalized students with 
close ties to similar classmates might benefit less from their 
classmates’ family members (or even suffer negative effects 
through poor academic attitudes, behaviors, or norms). Thus, high- 
SES students might benefit more from classmates’ family capital. 


Macrosystem: Country 


Classmate effects may differ across countries due to economic 
conditions or cultural values. In addition to their potential effect on 
student achievement, a country’s economy or cultural values might 
moderate classmate or classmate family influence on academic 
achievement. 

Economy. Richer countries (e.g., Switzerland) often have 
more public resources that can raise student achievement or the 
influence of high-achieving classmates. As richer countries often 
provide more public resources (e.g., public libraries, museums) or 
better education (e.g., certified English teachers; Baker, Goesling, 
& Letendre, 2002), students in richer countries often capitalize on 
these opportunities (e.g., reading more library books) to learn 
more. In addition, high-achieving classmates in richer countries 
might use these extra public resources not available in poorer 
countries to help a student learn more (e.g., explaining the word 
newt with a Wikipedia photo; Chiu, 2007). In richer countries, 
students might learn more or might be influenced more by high- 
achieving classmates. 

In addition to the amount of resources, the distribution of 
resources within a country might affect student achievement or 
classmate influences. Greater household income inequality within 
a country might reduce student achievement through diminishing 
marginal returns or homophily bias. Consider a student with two 
dictionaries. She keeps the first one on her desk and uses it often 
when she reads. In contrast, the second one sits on a bookcase 
(unless the first one is lost or is being used). This lower value of 
the second dictionary (or, more generally, additional resources of 
the same type) is diminishing marginal returns (Chiu & Khoo, 
2005). Applied to inequality, a poor student (with one book) likely 
learns more from an extra book than a rich student (with a hundred 
books) would. In countries with greater household income equality 
(such as Norway), poorer students have more resources and might 
benefit more from them, compared with richer students; resulting 
in higher achievement overall; hence, students in countries with 
greater equality might show greater achievement than those in 
countries with less equality (e.g., Colombia). In a similar vein, 
viewing students as resources, clustering high-achieving students 
together in a few classes (tracking) or schools (banding) away | 
from lower achieving students can reduce the latter’s access to 
high-achieving students and reduce overall achievement (Chiu & 
Khoo, 2005). 

Greater equality might also increase overall student achieve- 
ment through people’s homophily bias toward others of similar 
SES (Kindermann, 2007; McPherson et al., 2001). As a result, 
students in more equal countries might interact, cooperate, and 
share resources more often, resulting in higher achievement over- 
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all. Moreover, increased interactions might increase classmate 
effects in more equal countries. Meanwhile, clustering high- 
achieving students together might increase interactions among 
them due to homophily bias, which can mitigate the negative 
effects of unequal access and diminishing marginal returns. 

Cultural values. Apart from economies, cultural values dif- 
fer across countries (LeTendre, Hofer, & Shimizu, 2003). 
Countries differ according to their degree of status hierarchy 
and obedience to authority versus equality (hierarchical vs. 
egalitarian; or power distance), favoring group interests versus 
individual interests (collectivism vs. individualism), emphasiz- 
ing rigidly defined versus flexible gender roles (masculine vs. 
feminine), and tolerance of risk (uncertainty avoidance vs. 
uncertainty tolerance; Hofstede, 2003). Past studies have shown 
that these cultural values have no direct relationship with aca- 
demic achievement, but they moderate the links between other 
factors (e.g., self-concept) and academic achievement (Chiu & 
Klassen, 2009). Consider four hypotheses regarding the mod- 
erating effects of cultural influences. 

First, collectivistic societies value group interests over individ- 
ual interests (House, Hanges, Javidan, Dorfman, & Gupta, 2004). 
In collectivist societies (e.g., Argentina), classmates pay greater 
attention to one another’s preferences (e.g., favorite book), talk 
with one another more often, and conform more closely to group 
norms (e.g., listening, turn-taking, politeness) than in individual- 
istic societies (e.g., Russia; Chiu & Chen, in press-a). Thus, 
classmates are more likely to influence a student’s learning in 
collectivist countries than in individualistic countries (Chiu & 
Chen, in press-b). For example, classmates’ metacognitive strate- 
gies (e.g., set goals, evaluate progress) are more strongly linked to 
a student’s reading achievement in collectivist countries than in 
individualistic countries (Chiu, Chow, & McBride-Chang, 2007). 
Hence, classmate characteristics might have stronger links to a 
student’s academic attitudes, behaviors, or achievements in col- 
lectivist countries than in individualistic countries. 

Second, people in egalitarian countries (e.g., Italy) value equal 
status and equal opportunity more than those in hierarchical coun- 
tries do (e.g., New Zealand; Hofstede, 2003). In egalitarian coun- 
tries, high- and low-status students often attend the same schools, 
and teachers and classmates tend to treat them similarly, so their 
learning experiences tend to be more similar than those in hierar- 
chical countries (Chiu & Zeng, 2008). As students who perceive 
one another as more similar (e.g., more equal) are more likely to 
interact with one another due to homophily bias, students in 
egalitarian countries are more likely than those in hierarchical 
countries to interact with one another (McPherson et al., 2001). 
Moreover, high-achieving students are more likely to interact with 
high-achieving students than low-achieving students, so high- 
achieving classmates might influence a high-achieving student 
more than a low-achieving student. As noted earlier, high-SES 
students are more likely to interact with and influence high-SES 
students than low-SES students. Hence, the effects of classmate 
characteristics on a student’s attitudes, behaviors, or achievements 
might be stronger both in more egalitarian countries and on stu- 
dents with similar characteristics. 

Third, students in more feminine or gender egalitarian coun- 
tries (e.g., Canada) view gender roles as flexible, whereas those 
in masculine countries have rigidly defined gender roles (e.g., 
Kuwait; House et al., 2004). With different gender role expec- 


tations, boys and girls might attend different schools, receive 
different treatment from their parents or teachers, and experi- 
ence schooling differently (Chiu & Chow, 2010). As a result, 
girls might interact with girls more often, and boys might 
interact with boys more often. With different expectations and 
experiences, boys and girls might prefer to interact with class- 
mates of the same gender rather than different genders due to 
homophily bias, resulting in weaker social relationships and 
less influence between boys and girls (McPherson et al., 2001). 
Then, classmates of the same gender might have greater influ- 
ence than those of the opposite gender on a student’s attitudes, 
behaviors, or achievements. 

Last, students in cultures with greater uncertainty avoidance 
(e.g., Iran) have lower tolerance of risk. As they prefer familiar 
people, surroundings, and values, they typically prefer interacting 
with family and relatives that they have known for most of their 
lives rather than with relatively new classmates (Hofstede, 2003). 
As a result, students in these cultures are less likely than students 
in uncertainty-tolerant cultures (e.g., Netherlands) to spend much 
time engaging with their classmates. Furthermore, students with 
high uncertainty avoidance are more likely to value their family’s 
attitudes toward schools (e.g., utility of schooling to future career), 
follow their siblings’ behaviors (e.g., hours of study), and model 
their academic expectations on those of their siblings (e.g., which 
grades are acceptable), rather than those of their classmates. As a 
result, classmates might have less influence on students in cultures 
with greater uncertainty avoidance than on those in cultures with 
less uncertainty avoidance. Hence, the effects of classmate char- 
acteristics on a student’s attitudes, behaviors, or achievements 
might be weaker in countries with greater uncertainty avoidance 
than in other countries. 


Present Study 


The present study tests a model of classmate characteristics 
and academic achievement among elementary school children 
in 33 countries (see Table 1). We asked two research questions. 
First, are classmates’ characteristics (including family SES, 
home literacy resources, attitude toward reading, and past read- 
ing achievement) related to student reading achievement? We 
hypothesized that when classmates have higher family SES, 
better home literacy resources, better attitudes toward reading, 
or higher reading performance, students on average have higher 
reading achievement. 

Second, do these links among classmates’ factors, classmate 
family factors, and students’ reading achievement differ across 
countries with different economic characteristics or cultural val- 
ues? We expected links between classmates’ characteristics and a 
student’s reading achievement to be stronger in countries that are 
richer, more equal, more collectivist, or cluster students together 
by past achievement (tracking or banding). We expected same- 
gender classmate characteristics to be stronger in masculine coun- 
tries, whose gender roles are more rigid. Last, we expected links 
between classmates’ characteristics and a student’s reading 
achievement to be weaker in uncertainty avoidance countries. To 
account for the possibility that student achievement is related to 
classmate achievement simply because students of similar past 
achievement attend class together, we included student past 
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Table 1 
Theoretical Hypotheses 
Explanatory variable Hypothesized’ outcome Supported? 
Greater classmate characteristics Higher ; 
Family socioeconomic status Student reading achievement Yes 
Educational resources at home Student reading achievement Yes 
Parent attitude toward reading Student reading achievement Yes 
Attitude toward reading Student reading achievement Yes 
Past reading achievement Student reading achievement Yes 
Greater country characteristics Stronger 
Wealth Classmate influence Yes 


Economic equality 
Clustering students by ability 
Collectivism 

Egalitarian 

Gender egalitarianism 
Uncertainty tolerance 


achievement (specifically, parent rating of past literacy skills) as a 
control variable. 


Method 


Data 


The International Association for the Evaluation of Educa- 
tion Achievement’s Progress in International Reading Literacy 
Study (IEA—PIRLS) assessed 141,019 fourth-grade students’ 
reading achievement and asked students and principals to com- 
plete questionnaires related to their perceptions of themselves 
and their immediate environments (Martin, Mullis, & Kennedy, 
2003). Students completed an 80-min assessment booklet and 
then a 15- to 30-min questionnaire. We also used economic data 
(World Bank, 2002) and cultural values data (House et al., 
2004). 

This sample included a variety of countries, ranging from 
poor, unequal, collectivist nations (e.g., Colombia) to rich, 
relatively equal, individualistic ones (e.g., Norway). The re- 
gions and countries that participated were Argentina, Belize, 
Bulgaria, Canada (across Alberta, British Columbia, Nova Sco- 
tia, Ontario, and Quebec), Colombia, Cyprus, Czech Republic, 
France, Germany, Greece, Hong Kong, Hungary, Iceland, Iran, 
Israel, Italy, Kuwait, Latvia, Lithuania, Moldova, Morocco, the 
Netherlands, New Zealand, Norway, Romania, Russian Feder- 
ation, Singapore, Slovak Republic, Slovenia, Sweden, Turkey, 
Macedonia, England, Scotland, and the United States. As all 
responses to home questionnaires were missing in the Morocco 
and U.S. data, these two countries were not included in our 
analysis. 


Methodological Design 


Investigating the relationships between classmate characteristics 
and reading achievement across countries requires representative 
sampling, precise tests and questionnaires, and suitable statistical 
models. In each country, IEA chose about 150 representative 
schools based on neighborhood SES and student intake and sam- 
pled one or two fourth-grade classes from each school (stratified 
sampling), resulting in a sample size of about 4,000 students per 


Classmate influence 

Classmate influence 

Classmate influence Yes 
Classmate influence 

Same gender classmate influence 

Classmate influence Yes 


country or region (Martin et al., 2003). Students who had intel- 
lectual disabilities, refused to take the exam, could not physically 
take it, or did not understand the test language altogether ac- 
counted for less than 4% of the original sample. With suitable 
weights, IEA created representative samples of each country’s 
schools and fourth-grade students. 

Students received subtests (overlapping subsets of all multiple 
choice and open-ended questions) for wider coverage of reading 
skills while reducing student fatigue and learning during the test (a 
balanced incomplete block test; Baker & Kim, 2004). A graded- 
response Rasch model of these subtests measured the difficulty of 
each test item to estimate each student’s reading competence more 
precisely (Baker & Kim, 2004). 

To reduce measurement error, researchers used several ques- 
tionnaire items for each theoretical construct (e.g., SES) to 
create an index via a graded response Rasch model (Warm, 
1989). The multigroup graded-response Rasch models for each 
item in each country yielded similar parameters, indicating 
measurement equivalence across countries (Martin et al., 2003; 
May, 2006). (Unlike factor analysis, a multigroup graded- 
response Rasch model has two advantages: it requires only one 
invariant anchor item across countries and models heteroge- 
neous use of the ordinal rating scale; Rossi, Gilula, & Allenby 
2001). Still, the graded-response Rasch model assumes that the 
relationships between the items, and the construct are the same 
across countries. Other studies also have shown consistent 
questionnaire responses and participant understandings across 
countries (Brown, Micklewright, Schnepf, & Waldmann, 2007; 
Martin et al., 2003; Schulz, 2003). To estimate reliability, the 
graded-response Rasch models included computations of the 
information function (Baker & Kim, 2004); when the informa- 
tion function is greater, there is more information, smaller 
standard errors, more precision, and greater reliability (see 
Table 2 for the reliability of each index). PIRLS standardized 
the test scores to a mean of zero (and standard deviation of 1) 
for all data from all countries to facilitate identification of 
scores above and below the overall mean. 

Missing questionnaire response data (8%) can reduce estimation 
efficiency, complicate data analyses, and bias results. Markov 
chain Monte Carlo multiple imputation addresses these missing 
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data issues more effectively than deletion, mean substitution, or 
simple imputation (Peugh & Enders, 2004). 


Parent rating of past literacy skills. This graded-response 
Rasch-based index (using the Warm, 1989, procedure) was created 
from a parent’s or guardian’s responses to multiple questions in 
order to reduce measurement error. The items used to create this 
index were “My child .. .” (a) recognizes most of the alphabet 
letters, (b) reads words, (c) reads sentences, (d) writes letters of the 
alphabet, and (e) writes words. The response choices were not at 
all, not very well, moderately well, and very well. Its reliability was 
0.95. All reliabilities were measured with Cronbach’s alpha. 

Country. Country-level variables included economic condi- 
tions, cultural values, and distribution of students across schools. 
Economic growth was measured through gross domestic product 
per capita (GDP per capital; World Bank, 2002). Family income 
inequality was measured through GDP Gini index (the integral of 
the cumulative distribution function of a perfectly equal income 
society minus the integral of the cumulative distribution function 
of the actual society’s income; World Bank, 2002). Scores can 
range from 0 (perfect equality; everyone has equal income) to 100 
(perfect inequality; one person has all the income, and everyone 
else‘s income is zero). The Gini index is suitable for nonnormal 
distributions like household income (McKenzie, 2005). 

Similar to the Organization for Economic Cooperation and 
Development’s (OECD) Program for International Student Assess- 
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Class mean parent rating of past literacy skills* 


Class mean SES* 
Class mean parents attitude toward reading* 


Class mean home education resources* 
Student reading self-concept* 


Student reading attitude* 


Girl 


Class mean variables (entered at the class level) 
Psychology variables (entered at the student level) 


Table 2 (Continued) 
Gender (entered at the student level) 
4 Indices were standardized to M = 0 and SD = 1. 


maximum. 


Note. 
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ment (OECD, 2010), a consortium of 150 researchers collected 
over 17,300 responses to a survey of cultural values from middle 
managers in finance, telecommunications, and food processing in 
61 cultures (House et al., 2004). Cultural values differ mostly 
across countries, not within countries; indeed, cultural values are 
linked more strongly to one’s nation than to religion, employer 
organization, or individual personality (Hofstede, Neuijen, Ohayv, 
& Sanders, 1990; Inglehart & Baker, 2000), so managers’ cultural 
values serve as proxies for those of the entire nation. House et al. 
(2004) created all indices of cultural values from a polychoric 
correlation-based factor analyses of managers’ responses to ques- 
tions on a 7-point Likert scale. 

Cultural values included power distance, in-group collectivism, 
gender egalitarianism, and uncertainty avoidance. Hierarchy, or 
power distance, is the degree to which people expect and agree that 
power should be shared unequally, based on five questions with a 
reliability of 0.88. (see questionnaire items for cultural values in 
Appendix A, House et al., 2004). In-group collectivism is the 
degree to which people value collective action and collective 
distribution of resources, based on four questions with a reliability 
of 0.95. Meanwhile, gender egalitarianism is the degree to which 
people minimize gender inequality, based on five questions with a 
reliability of 0.95. Uncertainty avoidance is the degree to which 
people rely on social norms, rules, and procedures to reduce the 
unpredictability of future events, based on five questions with a 
reliability of 0.96. 

The distribition of students across schools within a country can 
differ. In some countries, high-achieving students attend one set of 
schools while low-achieving students attend a different set of 
schools, resulting in high clustering of students by past achieve- 
ment. In other countries, students are mixed together in the same 
schools regardless of past achievement. Hence, clustering of stu- 
dents by parent rating of past literacy skills is the ratio of parent 
rating of past literacy skills variance across schools divided by the 
total parent rating of past literacy skills variance within a country. 

Family. Family variables included three graded-response 
Rasch-based indices of parent responses to questionnaire items: 
SES, home educational resources, and parent attitude toward read- 
ing. SES was created from father’s education, mother’s education, 
father’s occupation, mother’s occupation, and responses to a ques- 
tion on family financial situation (“How well off do you think your 
family is financially?” with the response choices of not at all well 
off, not very well off, average, somewhat well off, and very well 
off). Occupation responses were recoded according to job status 
(according to Ganzeboom, 1992). Its reliability was 0.94. 

Home educational resources was created from the availability 
of the following educational resources at home: computer, study 
desk or table for the student’s own use, books of his or her own, 
access to a daily newspaper (choices: yes or no), number of books 
at home (choices: 0-10, 11-25, 26-100, 101-200, and > 200); 
number of children’s books at home (choices: 0-10, 11-25, 26- 
50, 51-100, and > 100). Its reliability was 0.79. 

Parent attitude toward reading was created from responses to 
the following questions: “I read only if I have to”; “I like talking 
about books with other people”; “I like to spend my spare time 
reading”; “I read only if I need information”; and “Reading is an 
important activity in my home.” The possible response choices 
were disagree a lot, disagree a little, agree a little, and agree a lot. 
Its reliability was 0.82. 


Availability of school resources was an inverted graded- 
response Rasch-based index of whether the principal perceived a 
shortage of the following: qualified teaching staff; teachers with a 
specialization in reading; instructional materials, supplies; school 
buildings and grounds; heating/cooling and lighting systems; in- 
structional space; computers for instructional purposes; computer 
software for instructional purposes; library books; and audio— 
visual resources. Response choices were not at all, a little, some, 
and a Jot. Its reliability was 0.94. 

Classmates. To test whether characteristics of classmates are 
linked to a student’s reading achievement, we included both the 
characteristic of a student (e.g., SES) and the mean of this char- 
acteristic for all students in the same class (class mean SES) into 
a regression. By controlling for SES, the regression coefficient of 
class mean SES indicates the relationship between classmates’ 
SES and a student’s reading achievement. Similarly, we computed 
class mean parent rating of past literacy skills, class mean home 
educational resources, and class mean attitude toward reading. 

Psychology and gender. Student variables included student 
reading attitude, student reading self-concept, and girl. Student 
reading attitude was a graded-response Rasch-based index of 
student responses to the following: “I read only if I have to”; “I 
like talking about books with other people”; “I would be happy if 
someone gave me a book as a present”; “J think reading is boring”; 
and “I enjoy reading.” Response choices were disagree a lot, 
disagree a little, agree a little, and agree a lot. Its reliability was 
0.86. Student reading self-concept was a graded-response Rasch- 
based index of student responses to the following: “Reading is 
very easy for me”; “I do not read as well as other students in my 
class’; and “Reading aloud is very hard for me.” Response choices 
were disagree a lot, disagree a little, agree a little, and agree a lot. 
As its reliability was 0.62, results involving this variable require 
cautious interpretation. Last, girl has a value of 1 for girls and a 
value of 0 for boys. See Table 2 for overall summary statistics and 
Appendix Tables Bl, B2, and B3 for correlation-variance- 
covariance matrices; country means of (a) numbers of classes, (b) 
numbers of students, (c) percentage of missing data; and (d) key 
variables. 

Analysis. A multilevel logit analysis of plausible values yields 
more precise standard errors than does ordinary least squares 
(Goldstein, 1995; Rust & Rao, 1996). The simple variance com- 
ponents multilevel model tests if the variance at each level is 
significant. 


Reading; = Booo + eijx + Loje + Soox (1) 


The outcome variable Reading,, of student i in school j in 
country k has a grand mean intercept Booo, with student-, school-, 
and country-level residuals (€;,, foj,, and Zoo,, respectively). Ex- 
planatory variables were entered in sequential sets to estimate the 
variance explained by each set (Kennedy, 2008). The index of a 
student’s early literacy skills at first grade (see Table 2 for details) 
reflects the cognitive component of reading and past reading skills 
and is entered first. Next, we considered ecological variables, 
namely, country, family, classmate, and school characteristics. 
Country variables might affect family variables. As families might 
choose their children‘s schools, family variables might affect 
classmate and school variables. All of these variables, broadly 
defined as the “ecology” in which reading occurs, might affect 
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students’ psychological variables. Hence, we entered the variables 
as follows: parent rating of past literacy skills, ecological (country, 
family, classmate, school), and psychological. All continuous vari- 
ables were centered on their country mean. 


Reading; = B + eijx + foj + Soox 
+ By Prior _literacy _ skills 
+ B.Country + 6,,Family;,, + B,,,Classmate;, 
+ 8, School, + BpiPsychological; + BojGirlijx 
(2) 


First, we entered each student’s index of parent rating of past 
literacy skills when they began first grade (Prior_literacy_skill, 
see Table 2). Then, we tested whether sets of predictors were 
significant with a nested hypothesis test (chi-square log likelihood; 
Kennedy, 2008). Nonsignificant variables were removed. Then, 
we applied this procedure for Prior_literacy_skill to the country 
variables, log GDP per capita, Gini index, clustering of students by 
past achievement, power distance, in-group collectivism, gender 
egalitarianism, and uncertainty avoidance (Country). Next, we 
applied this procedure to family variables: socioeconomic status 
(SES), home education resources, and parent attitude toward read- 
ing (Family). 

To test whether country economic characteristics or cultural 
values moderate these links, we applied a random effects model 
(Goldstein, 1995) to determine if the regression coefficients 
(Bex = Broo + fgx + Spx) differed across countries (4, # 0?) or 
correlated with Country variables. 

Then, we applied the procedure for Family to classmates’ 
family variables: class mean SES, class mean home education 
resources, class mean attitude toward reading, and class mean 
parent rating of past literacy skills (Classmate). This part of the 
specification tests whether classmates’ family SES, home literacy 
resources, attitude toward reading, or their past reading achieve- 
ment affect a student’s reading achievement. As noted previously, 
the random effects model tests if country characteristics moderate 
links between classmate characteristics and student reading 
achievement. 

Next, we applied this procedure to the school variable: avail- 
ability of school resources (School). Then, we applied this proce- 
dure to psychological variables: attitude toward reading, reading 
self-concept (Psychological). We also tested gender (Girl). 

Furthermore, we tested whether the relation between classmate 
characteristics and reading achievement differed across students 
by applying a random effects model (Goldstein, 1995). We tested 
if the regression coefficients (Bg, = Bioo + fgx + Spx: Bp; 
Broo + fyix + Sox) differed across classes (fg, # 0? fp; # 0?) or 
correlated with Classmate variables. 

To test for multilevel, mediation effects by classmate variables, 
we used the multilevel M test, which corrects for potential non- 
normal distributions and tests the significance of a confidence 
interval based on a critical z ratio determined across multiple data 
simulations (MacKinnon, Lockwood, & Williams, 2004). This test 
reduces false positives (MacKinnon et al., 2004) and has more 
power than other methods to detect small mediation effects 
(Pituch, Stapleton, & Kang, 2006). 


We report how a 10% increase in each continuous variable 
above its mean is linked to reading achievement (result = b t SD 
[10%/34%]; 1 SD ~ 34%). As percentage increase is not linearly 
related to’standard deviation, scaling is not warranted. 

We used an alpha level of .05. To minimize false positives, we 
controlled for the false discovery rate with the two-stage linear 
step-up procedure, which outperformed 13 other methods in com- 
puter simulations (Benjamini, Krieger, & Yekutieli, 2006). The 
small sample of countries (V = 33) limits identification of non- 
significant country-level results (Konstantopoulos, 2008; see Table 
3 for details). As our random effects model portion of the multi- 
level analysis produces estimates of the effects for each country, 
we tested whether the results differed across the 33 subsamples of 
each country. We also analyzed residuals for influential outliers. 
The analyses were completed with item response theory (IRT) 
command language (Hanson, 2002), MIn (Rasbash & Woodhouse, 
1995), and LISREL (Jéreskog & Sorbom, 2004). 


Results 


Explanatory Model 


Parent rating of past literacy skills, country, family, classmate, 
school, and psychological variables accounted for differences in 
students’ reading scores (see Table 4). Reading scores differed 
across countries (32%), across schools (20%), and across students 
(47% of the variance). All results discussed in this section describe 
first entry into the regression, with all previously included vari- 
ables controlled. Ancillary regressions and statistical tests are 
available upon request. 

Parent rating of past literacy skills. As expected, students 
with higher parent ratings of past literacy skills had higher reading 
achievement scores. Parent rating of past literacy skills accounted 
for 4% of the variance in students’ reading achievement. 

Country. Countries’ economic characteristics and degree of 
clustering of students, but not its cultural variables, were linked to 
a student’s reading achievement. Students in countries with greater 
economic growth (higher GDP per capita) had higher reading 
scores (log GDP per capita accounted for more variance than GDP 
per capita). Economic growth showed the strongest link with 
reading achievement, with a beta of 0.27 in the final model (see 
Table 4, Model 4, row 8). In countries with greater family income 
inequality (higher Gini), students had lower reading scores. Also, 
in countries with greater clustering of students across schools by 
parent rating of past literacy skills (tracking or banding), students 
had lower reading scores. Cultural values were not directly related 
to reading scores. These country variables accounted for 9% of the 


Table 3 
Statistical Power at Each Level of Analysis for Each Effect Size 


Effect size 


Level—variable 0.1 0.2 0.3 0.4 
3—Country 0.09 0.20 0.39 0.61 
2-Class 0.65 1.00 1.00 1.00 
1-Student 0.71 1.00 1.00 1.00 


Note. Sample = 141,019 fourth-grade students from 5,279 schools in 33 
countries. 
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Summary of Four Multilevel Regression Models Predicting Students’ Reading Scores: Unstandardized Regression Coefficients, 


(Standard Errors), and Standardized Regression Coefficients 


er steerer epee mero eo ani cuprates igimmammemsataaaiamaatis 


Explanatory variable 


Parent rating of past literacy skills 
Log GDP per capita 
Gini index 


Clustering of students across schools by parent rating of past literacy skills 


Power distance 
In-group collectivism 
Gender egalitarianism 
Uncertainty avoidance 
Socioeconomic status 
Home education resources 
Parent attitude toward reading 
Class mean socioeconomic status 
Class mean home education resources 
Class mean attitude towards reading 
Class mean parent rating of past literacy skills 
Availability of school resources 
Students’ attitude towards reading 
Students’ reading self-concept 
Girl 
Log GDP per capita “ Class mean parent rating of past literacy skills 
In-group collectivism” Class mean parent rating of past literacy skills 
Uncertainty avoidance ~ Class mean parent rating of past literacy skills 
SES * Class mean SES 
Home education resources ~ Class mean home education resources 
Attitude towards reading * Class mean attitude towards reading 
Parent rating of past literacy skill ~ Class mean parent rating of past 
literacy skills 
Variance at each level 
Country (32%) 
School (20%) 
Student (47%) 
Total variance explained 


Model 1 
14.83 (0.15)*** 


14.82 (0.15)"™" 


Model 3 


S82 (ONS)ia- 


Regressions predicting reading achievement 


Model 2 


Model 4 
8.97 (0.15)"™* 


43.26 (4.62)"" 421 BID" 3808 65) 
=278(0:37)"" -2P21-59 030)" “8161 O20)" 
—107.8 (29.76)"** —92.23 (29.20)"* —39.56 (28.69) 
0.10 (0.14) 0.11 (0.12) 0.09 (0.11) 
~0.12 (0.14) ~0.05 (0.12) ~0.11 (0.11) 
0.11 (0.08) 0.13 (0.07) 0.06 (0.07) 
0.02 (0.11) 0.02 (0.09) 0.01 (0.09) 
8.54 (0.17)"** 7.97 (0.18)*** 
10.40 (0.20)"** 10.01 (0.20)"** 
3.42 (0.14)""" 3 Oa 
11.52 (0.94)"** 10.83 (0.93)"** 
19.67 (1.06)"** 20.28 (1.05)*** 
9.19 (0.81)""* 8.50 (0.81)""" 
5.37 (0.91)"** 6.62 (0.92)"** 
1.30 (0.51)" 1.16 (0.49)* 
9.30 (0.14)""" 9.47 (0.15) 
16.49 (0.14)"** 16.35 (0.14)"*"" 
8.71 (0.28)*** 8.65 (0.28)"*" 
6.54 (1.91)"* 
0.37 (0.06)"** 
—0.44 (0.05)""* 
1.85 (0.24)""" 
3.57 (0.26) 
2.48 (0.38)""* 
5.31 (0.31) 
0.02 0.07 0.42 0.42 
0.06 0.06 0.43 0.46 
0.04 0.04 0.20 0.20 
0.04 0.05 0.31 0.32 


Note. Each regression included a constant term. GDP = gross domestic product; SES = socioeconomic status. 


Bao euOl, oh p< 100K: 


reading achievement differences between countries and 1% of the 
total variance in students’ reading achievement. 

Family. Students with higher family SES, more educational 
resources at home, or better parent attitudes toward reading scored 
higher in reading, consistent with past research. Family character- 
istics accounted for 15% of the variance in reading scores. 

Classmates. When classmates had higher family SES, more 
home education resources, better parent reading attitudes, better 
reading attitudes, or greater parent rating of past literacy skills, 
a student scored higher in reading. These results support the 
view that classmate family factors contribute to a student’s 
reading achievement. Classmates’ home education resources 
showed the third strongest link to a student’s reading achieve- 
ment (§ = .14). Controlling for classmate family factors, class- 
mate attitudes and parent rating of past literacy skills were still 
significantly linked to student reading achievement, showing 
that classmate family factors do not fully explain the relation 
between classmates and a student’s reading achievement. No- 
tably, the regression coefficient of class mean attitude toward 
reading was larger than that of class mean parent rating of past 
literacy skills. Classmate characteristics accounted for 6% of 
the variance in reading scores. 


School. Furthermore, students in schools with more resources 
had higher reading scores. School resources accounted for 2% of 
the variance in reading scores. 

Psychology and gender. Students with better reading atti- 
tudes or higher reading self-concept (second highest B = .17) had 
higher reading scores, accounting for 2% of the variance in reading 
scores. Girls outscored boys in reading on average, accounting for 
1% of the variance in reading scores. 

Differences across countries. The link between reading 
achievement and class mean past achievement varied across coun- 
tries’ economic status and cultural contexts. In countries that were 
wealthier, more collectivist, or less uncertainty avoidance, the link 
between classmate parent rating of past literacy skills and a stu- 
dent’s reading score was larger. 

Differences across students. The links between classmate 
characteristics (SES, home educational resources, attitude to- 
ward reading, and parent rating of past literacy skills) and a 
student’s reading achievement also varied across students. With 
respect to reading achievement score, a higher SES student 
benefits more than lower SES students from higher SES class- 
mates. Likewise, when classmates have greater home education 
resources, students with more such resources benefit more than 
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students with fewer such resources. In schools where class- 
mates have better attitudes toward reading, students who have 
better attitudes toward reading benefit more than those with 
weaker attitudes toward reading. Last, students with stronger 
parent rating of past literacy skills benefit more than students 
with weaker such skills from classmates with greater parent 
rating of past literacy skills. Note that these interactions do not 
nullify the large regression coefficient of class mean SES. 
Despite the differences in benefits, low-SES students with 
high-SES classmates still have substantially higher reading test 
scores than other students. In short, when classmates have more 
of a resource, attitude, or skill, students with more of it benefit 
more than students with less of it. Thus, these results are 
specific to each dimension (students with greater parent rating 
of past literacy skills did not benefit more from classmates with 
better attitudes toward reading). 

These relationships across countries and across students ac- 
counted for 1% of the variance in reading scores. Otherwise, these 
results were consistent across all 33 countries, and there was no 
significant mediation. Examination of the residuals did not show 
influential outliers. 


Discussion 


While students with higher achieving classmates show higher 
academic achievement in many countries with different education 
policies (Kang, 2007; Zimmer & Toma, 2000), researchers have 
not explicated whether classmate characteristics are related to a 
student’s learning nor have they tested whether these links differ 
across countries. The present study extends this line of research by 
showing that classmate characteristics (parent rating of past liter- 
acy skills, attitudes toward reading) and classmates’ parents’ char- 
acteristics and resources (parent SES, material educational re- 
sources at home, parents’ attitudes toward reading) were both 
associated with greater student reading achievement in 33 coun- 
tries, controlling for family characteristics, which were also sig- 
nificantly and substantially related to reading achievement (con- 
sistent with past studies; e.g., Chiu & McBride-Chang, 2006). 
Furthermore, classmates’ parent rating of past literacy skills 
showed stronger links to student reading achievement in countries 
that are richer, more collectivist, or more tolerant of uncertainty. 
These findings underscore the importance of multiple levels of 
environmental contexts and their interactions to provide a more 
comprehensive account of children’s academic achievement. We 
discuss each of these findings. 

Classmates’ attitudes toward reading and parent rating of past 
literacy skills were both linked to a student’s reading achievement. 
The standardized regression coefficient of classmates’ attitudes 
toward reading is larger than that of classmate parent rating of past 
literacy skills, suggesting that classmate attitude has a stronger link 
with a student’s reading achievement. Possibly, classmate achieve- 
ment might be more likely than classmate attitude to trigger a 
negative social comparison, which could reduce a student’s read- 
ing self-concept and, consequently, his or her reading achievement 
(Chiu & Klassen, 2009). Future studies can further examine this 
issue. 

Classmates’ parent and home characteristics were also related to 
a student’s reading achievement. Consistent with past research, 
students whose classmates had higher family SES had higher 


reading performance (Caldas & Bankston, 1997). Controlling for 
classmate parent SES, students whose classmates had more edu- 
cational resources at home had higher reading achievement. Class- 
mate home educational resources had the third largest standardized 
regression coefficient (behind log GDP per capita and reading 
self-concept). Both classmate SES and classmate home educa- 
tional resources had much larger standardized regression coeffi- 
cients than classmate attitude or classmate parent rating of past 
literacy skills, suggesting that classmate family resources have 
stronger links to a student’s learning than do attitude or literacy 
skills. These findings highlight the prominence of classmates’ 
family environments (Bradley & Corwyn, 2002). Such family 
environments might help both their children’s reading achieve- 
ment directly and their classmates’ reading achievements indi- 
rectly; thus, their benefits are twofold. 

However, classmates did not benefit each student equally. When 
classmates had more of a resource, attitude, or skill, students with 
more of it had higher reading achievement than students with less 
of it. These dimension-specific results are consistent with the 
dimension-specific mechanisms of selective trading, norm estab- 
lishment/maintenance, and skill practice. Selective trading of ed- 
ucational resources might have helped students from high-SES 
families or with more home educational resources benefit from 
similarly advantaged classmates (Mankiw, 2011). When class- 
mates have positive attitudes toward reading, students with posi- 
tive attitudes toward reading might engage in normative reading- 
related practices more often and learn more, compared with 
students with negative attitudes toward reading (Chiu & McBride- 
Chang, 2006). Moreover, classmates with greater parent rating of 
past literacy skills might engage and benefit students with stronger 
such skills more than those with weaker such skills. 

This study also extends past international research on class- 
mates’ relations to reading achievement by examining how these 
relations differ across countries. Specifically, the present study 
indicated that the links between classmate parent rating of past 
literacy skills and a student’s current reading achievement was 
stronger in countries that were richer, more collectivist, or more 
tolerant of uncertainty. The stronger classmate link in richer coun- 
tries is consistent with the view that classmates in richer countries 
have more resources that they can use to influence a student’s 
reading achievement (Chiu & Chow, 2010). Meanwhile, the stron- 
ger classmate link in more collectivist countries fits the view that 
classmates in more collectivist cultures are more likely to pay 
attention to one another, interact with one another, help one 
another, and hence influence one another (Wade-Benzoni et al., 
2002). Also, the weaker classmate link in countries with greater 
uncertainty avoidance is consistent with the view that students 
interact with their classmates less often and learn less from them in 
countries with greater uncertainty avoidance. 


Implications 


If future studies replicate these results, they would have several 
implications for research and policy: (a) multidimensional class- 
mate links to student learning, (b) possible classmate interventions, 
(c) potential for harmful economic segregation of students, (d) 
unequal classmate benefits, (e) dimension-specific advantages, and 
(f) country differences. First, the multidimensional characteristics of 
classmates and their families are necessary components of a complete 
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theoretical model of classmates and student learning. A student does 
not simply benefit from classmate resources in a general, generic 
way. Instead, each specific classmate resource has a specific 
benefit. Furthermore the relationships between each classmate 
characteristic and a student’s reading achievement differed. In- 
deed, classmate family characteristics (classmate family SES, 
home educational resources) had stronger links to student reading 
achievement than classmate attitudes toward reading did, which, in 
turn, had a stronger link to it than classmate parent rating of past 
literacy skills did. 

Second, the impact of classmates suggests that interventions for 
low-achieving students might consider including classmates. Be- 
yond serving as sources of information, classmates might help a 
student through motivating a better attitude toward reading, greater 
study time, or further perseverance. Classmates might also help 
create and maintain supportive norms of attitude, behavior, and 
achievement. Future studies can test such interventions. 

Third, the substantial standardized regression coefficients of 
classmate family SES and home educational resources (especially 
compared to classmate attitude or parent rating of past literacy 
skills) are consistent with the danger of harmful economic segre- 
gation (Chiu & Khoo, 2005). If rich students attend separate 
schools away from poor students, poor students can luse access to 
rich students’ family resources and learn less (Chiu, in press). 

Fourth, even if students are economically integrated into the 
same schools, they do not share resources equally (Ryan, 2001). 
High-SES classmates benefit high-SES students more than low- 
SES students. Likewise, classmates with more home educational 
resources benefit students with more home educational resources 
more than those with fewer such resources. These results are 
consistent with: (a) high-SES students have more resources or 
skills to attract other high-SES classmates (Chiu & Chow, 2010) or 
(b) dimension-specific advantages through selective trading of 
resources (Mankiw, 2011). 

Fifth, classmate attitudes and skills show specific benefits. 
These results suggest that we match classmates with, say, specific 
skills to students who need those skills rather than simply letting a 
student get help from any classmate who might have the desired 
skills. Furthermore, classmates with better attitudes toward reading 
benefit students with better attitudes toward reading more than 
those with poorer attitudes, possibly through classmate norms that 
encourage related learning behaviors (Chiu & McBride-Chang, 
2006). Meanwhile, classmates’ parent rating of past literacy skills 
benefit students with greater parent rating of past literacy skills 
more than those with weaker such skills, possibly through trading 
information or shared practice (Mankiw, 2011). Future studies can 
test these possible mechanisms. 

Sixth, the standardized regression coefficients of classmate 
characteristics differ across countries with respect to their econo- 
mies and cultural values. Therefore, educational interventions and 
policies that rely on peer interaction and cooperation at school 
might have greater impact in countries that are richer, are more 
collectivist, or avoid uncertainty less. These findings underscore 
the view that a full account of classmate influence must capture 
characteristics of each country’s economy and cultural contexts. 

Still, most of the relations between classmate characteristics and 
reading achievement were consistent and robust across countries. 
(Classmate literacy skills is the notable exception.) Thus, these 
classmate relations with reading achievement remain candidates 


for universality and show the importance of classmates when 
seeking to understand antecedents of reading achievement. Over- 
all, the findings of this study showed academic achievement’s 
links with family, classmate, and country characteristics, under- 
scoring the importance of studying various ecological systems as 
conceptualized in Bronfenbrenner’s (2005) theory. 


Limitations and Future Studies 


This study had several limitations. First, this study focused on 
students’ reading achievement, in which students learn different 
scripts across countries. Future studies can include different as- 
pects of academic performance, such as mathematics and history, 
to provide a more comprehensive picture of classmate influence. 
Second, some measures in this pre-existing data do not perfectly 
capture the constructs of interest. For example, the cultural values 
are similar among citizens within a country (Inglehart & Baker, 
2000), but some students, especially from cultural minorities, 
might have substantially different cultural values. For example, a 
Chinese American student living in the United States might have 
much more collectivist values than most individualistic U.S. stu- 
dents. As this study does not account for these possibilities, the 
statistical power of this study is lower, possibly yielding some 
nonsignificant statistical results that in reality should be signifi- 
cant. Thus, future studies can also collect individual students’ 
cultural values. Also, some of the differences in reading achieve- 
ment between countries might stem from their different orthogra- 
phies. Specifically, the difficulty levels of reading vary across 
orthographies, with the transparent scripts easier to learn than the 
opaque ones. Also, the timing of when children formally start 
learning to read differs across countries. Furthermore, future stud- 
ies can use behavioral measures rather than surveys. Last, this 
correlational study does not warrant causal interpretations, which 
future studies can help address with longitudinal designs. 


Conclusion 


This study explicated the relationship between classmate read- 
ing achievement and student reading achievement, suggesting pos- 
sible mechanisms and showing when these differed across coun- 
tries. Classmates’ family SES and educational resources at home 
were more strongly linked to student reading achievement than 
were classmates’ attitudes toward reading or their parent rating of 
past literacy skills, controlling for country, family, school, and 
student characteristics. However, these classmate links to reading 
achievement differed across students. High-SES classmates bene- 
fited high-SES students more than low-SES students. These 
dimension-specific advantages also applied to classmates’ home 
educational resources, attitudes toward reading, and parent rating 
of past literacy skills. 

Furthermore, the links between classmate past reading achieve- 
ment and a student’s current reading achievement were stronger in 
countries that were richer, were more collectivist, or had less 
uncertainty avoidance. Other classmate characteristics’ links to a 
student’s past achievement did not differ significantly across coun- 
tries, so they remain candidates for universal relations across all 
countries. These findings underscore the view that a full account of 
a student’s reading achievement must capture the ecological rela- 
tionships in the family microsystem, classmate microsystem, class- 
mate family mesosystem, and the country macrosystem. 
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Appendix A 
GLOBE Cultural Value Questions 


All questions below are on a 7-point Likert scale. 


Power Distance 


3-5. I believe that a person’s influence in this society should be 
based primarily on: 

(1) one’s ability and contribution to the society . . . (7) the 
authority of one’s position. 


3-13. I believe that followers should: 
(1) obey their leader without question . . 
leader when in disagreement. 


. (7) question their 


3-28. I believe that people in positions of power should try to: 

(1) increase their social distance from less powerful individ- 
uals . . . (7) decrease their social distance from less powerful 
people. 

3-33. When in disagreement with adults, young people should 
defer to elders. 

(1) Strongly agree . . . (7) Strongly disagree 

3-35. I believe that power should be: 


(1) concentrated at the top . . . (7) shared throughout the 
organization 


Institutional Collectivism 


3-7. I believe that in general, leaders should encourage group 
loyalty even if individual goals suffer. 

(1) Strongly agree . . . (7) Strongly disagree. 

3-12. I believe that the economic system in this society should 
be designed to maximize: 

(1) individual interests . . . (7) collective interests. 


3-36. In this society, most people prefer to play: 
(1) individual sports . . . (7) team sports. 


3-37. I believe that: 
(1) group cohesion is better than individualism . . . (7) individ- 
ualism is better than group cohesion 


Gender Egalitarianism 


3-17. I believe that boys should be encouraged to attain a higher 
education more than girls. 


(1) Strongly agree . . . (7) Strongly disagree. 

3-22. I believe that there should be more emphasis on athletic 
programs for: 

(1) boys . . . (7) girls. 

3-26. I believe that this society would be more effectively 
managed if there were: 

(1) many more women in positions of authority than there are 
now .. . (7) many less women in positions of authority than there 
are now. 


3-38. I believe that it should be worse for a boy to fail in school 
than for a girl to fail in school. : 

(1) Strongly agree . . . (7) Strongly disagree. 

3-39. I believe that opportunities for leadership positions should 
be: 

(1) more available for men than for women . . 
available for women than for men 


. (7) more 


Uncertainty Avoidance 


3-1. I believe that orderliness and consistency should be 
stressed, even at the expense of experimentation and innovation. 

(1) Strongly agree . . . (7) Strongly disagree. 

3-16. I believe that a person who leads a structured life that has 
few unexpected events: 

(1) has a lot to be thankful for . 
excitement. 


. . (7) is missing a lot of 


3-19. I believe that societal requirements and instructions should 
be spelled out in detail so citizens know what they are expected to 
do. 

(1) Strongly agree . . . (7) Strongly disagree. 

3-24. I believe that society should have rules or laws to cover: 

(1) almost all situations . . . (7) very few situations. 


3-25. I believe that leaders in this society should: 
(1) provide detailed plans concerning how to achieve goals... 


(7) allow the people freedom in determining how best to achieve 
goals. 


(Appendices continue) 


CLASSMATE CHARACTERISTICS AND STUDENT ACHIEVEMENT 167 


Appendix B 
Ancillary Tables and Results 


Table B1 
Correlations of Key Variables 


1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 


2 0.20 

3 0.20 0.06 

Atee=—().33 OFT Orsi 

Smn=0.19 0.06 —0.59 0.51 
Grm—0.21 —0:04 =0:49 0.36 0.43 
7 

8 

o 





—0.30 0.08 —0.43 0.62 0.54 0.69 
0.21 0.04 0.49 -0.36 -043 —-1.00 —0.69 
0.18 0.14 OSS 01059-0114 026 =0N7 0.26 
10 0.36 0.09 022> —0:26) 0:20" 018 - = 0:26 0.18 0.10 
11 0.45 0.13 320398 e 0:30 803) 088 0.32 0.27 0.52 
12 0.42 0.08 O36" —047)5—0:32, 30:30 042 0.30 0.16 0.62 0.57 
13 0.49 0.10 O55 0550.41 0:45" 0554 0.45 0.38 049 0.71 0.80 
14 0.25 0.08 000" 0255 *—0:06" 0:16 =—0:21 0.16 0.02 0.29 0.33 0.47 0.47 
15 0.16 0.48 0.13 0.23 0.14 —0.08 0.17 0.08 0:28 0:10" O85 O16) 0:20 7016 
16 0.26 0.03 Os -—0:34-5—0:22> 3-032) 20338 0.32 0.29 0.20 0.31 0.32 044 0.14 0.06 
17 0.19 0:12. —007 0.03 0.08 0.06 0.09 —0.06 —0.04 0.05 0.09 0.02 0.00 0.06 0.07 —0.02 
18 0.29 0.15 0:08! —0:10) "—0:07 “=0:08 ——0:05 0.08 0.01 O95 3059 O12" (0:13 Y0ON0 0105 0.03 0.25 
19 0.09 0.09 0.00 0.01 0.00 0.01 0.01 —0.01 0.00 —0.01 0.02 0.00 0.00 0.01 0.04 —0.02 0.20 0.07 


Note. Variables: (1) Reading achievement, (2) Parent rating of past literacy skills, (3) Log gross domestic product (GDP) per capita, (4) Gini index, (5) 
Clustering students across schools by past achievement, (6) Power distance, (7) In-group collectivism, (8) Gender egalitarianism, (9) Uncertainty avoidance, 
(10) Socioeconomic status (SES), (11) Home education resources, (12) Class mean SES, (13) Class mean home education resources, (14) Class mean 
attitude toward reading, (15) Class mean past achievement, (16) Availability of school resources, (17) Students’ attitude towards reading, (18) Students’ 
reading self-concept, (19) Girl. 
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Table B2 
Number of Classes, Mean Number of Students Per Class, and Proportion of Missing Data in 
Each Country 


Ee EEE —Ee—E—EEEE—eeeeeEE ee 


Country Number of classes Mean students per class % Missing 
Argentina 138 23.9 12 
Belize 139 20.9 10 
Bulgaria 170 20.4 5 
Canada 409 20.2 10 
Colombia 196 26.2 9 
Cyprus 150 20.0 8 
Czech Republic 141 21.4 i 
France 22K 16.6 10 
Germany 393 19.4 10 
Greece 145 172, 8 
Hong Kong 147 34.4 b 12 
Hungary 216 22.0 7 
Iceland 242 Hye2 11 
Iran 282 26.3 6 
Israel 147 27.0 6 
Italy 184 19.0 6 
Kuwait 265 26.9 13 
Latvia 141 21.4 a 
Lithuania 146 17.6 8 
Moldova 150 23.6 4 
Netherlands 195 2A 7 
New Zealand 173 14.5 9 
Norway 199 17.4 5 
Romania 167 Dive 5 
Russian Federation 206 19.9 6 
Singapore 196 85:1 4 
Slovak Republic 176 21.6 3 
Slovenia 155 19.0 7 
Sweden 344 20.9 6 
Turkey 154 33.3 8 
Macedonia 159 23.6 13 
England 132 23.9 14 
Scotland 136 20.0 8 
Overall 6414 22.0 8 
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Table B3 


Country Means and Standard Deviations of Key Variables 
ee EEE SEO Sere Ree eee eee hr see ant ete STPRRINE EP oT ea cy er er oe a 





Class mean 
Class mean Class mean Class mean parent rating of 
Reading home education parent attitude attitude towards past literacy 
achievement Class mean SES resources toward reading reading skills 

Country name M SD M SD M SD M ESD M SD M SD 
Argentina 417.85 96.02 —0.45 0.43 —0.69 0.44 —0.20 0.20 0:23 0.24 —0.03 0.31 
Belize 326.26 106.42 = 059 0.60 —O75 0.46 Oi 0.31 —025 0.33 O18 0.41 
Bulgaria 550.37 82.92 —0.16 0.73 = (i295 0.79 0.10 0.67 0.21 0.37 0.08 0.62 
Canada 544.31 od 0.54 0.44 0.58 0.38 0.11 0.32 0.00 0.34 0.27 0.33 
Colombia 422.23 80.74 —0.67 0.69 ts 0.46 —0.02 0.34 0.16 0.29 —0.01 0.47 
Cyprus 493.56 82.15 0.03 0.41 ae), 03, 0.24 0.07 0.22 0.02 0.29 = O07 0.22 
Czech Republic 536.81 63.96 0.08 0.31 0.44 0.33 0.07 0.26 —0.28 0.34 —0.61 0.25 
France 525.40 70.48 —0.01 0.46 0.30 0.41 —0.14 0.29 0.04 0.28 0.16 0.25 
Germany 538.89 66.84 0.38 0.29 0.39 0.32 Ol) 0.35 0.07 0.31 =0.37, 0.26 
Greece 524.45 72.68 —0.06 0.65 —0.06 0.42 0.25 0.33 0.18 0.39 0.34 0.28 
Hong Kong S275 63.38 —0.56 0.50 —0.43 0.41 Oral 0.21 = 0:22 0.28 0.45 0.16 
Hungary 543.54 65.27 0.16 0.44 0.40 0.50 0.40 0.37 —0.17 0.39 —0.69 0.29 
Iceland 511.572 75.43 0.28 0.48 0.78 0.26 0.00 0.29 0.18 0.34 —0.18 0.27 
Tran 414.73 92.27 =D 0.58 ake 0.64 —0.01 0.42 0.06 0.43 —0.14 0.72 
Israel 507.67 94.07 0.11 0.36 0.11 0.33 —0.06 0.22 —0.08 0.31 0.42 0.30 
Italy 541.00 71.01 0.25 0.38 = (82: 0.37 —0.05 0.32 sai, 0.35 OS 0.29 
Kuwait 395.90 89.48 0.01 0.13 —0.34 0.35 —0.26 0.17 —0.04 0.30 —0'33 0.26 
Latvia 544.74 60.72 0.21 0.36 0.27 0.41 —0.08 0.24 —0.22 0.30 0.21 0.33 
Lithuania 543.53 63.99 0.28 0.39 —0.16 0.41 —0.08 0.26 —0.04 0.33 0.09 0.33 
Moldova 491.40 74.74 = Ocho) 0.48 —0.85 0.54 —0.24 0.36 0.17 0.36 —0.34 0.50 
Netherlands 554.38 S753 0.03 0.31 0.35 0.30 —0.09 0.23 —0.24 0.35 —0.24 0.23 
New Zealand 528.86 93.46 0.40 0.51 0.58 0.41 0.20 0.36 —0.01 0.39 0.11 0.27 
Norway S 499 44 82.11 0.58 0.48 0.80 0.27 0.32 0.34 —0.08 0.32 —0.02 0.25 
Romania 511.80 89.01 —0.38 0.52 —0.68 0.63 —0.28 0.56 0.29 0.32 =023 0.45 
Russian Federation 528.58 67.83 0.35 0.43 —0.07 0.54 —0.03 0.38 0.16 0.35 —0.26 0.59 
Singapore 528.29 91.02 —0.04 0.59 0.42 0.48 —0.20 0.21 0.12 0.32 0.79 0.41 
Slovak Republic 518.86 70.02 0.12 0.41 0.07 0.47 0.28 0.32 —0.20 0.34 —0.65 0.27 
Slovenia 501.20 71.91 0.04 0.36 0.06 0.34 0.14 0.26 0.05 0.35 0.13 0.26 
Sweden 561.17 65.49 0.54 0.47 0.88 0.37 0.32 0.34 0.01 0.32 0.18 0.27 
Turkey 449.01 86.64 =O. 5 1 0.50 —1.03 0.59 --0.28 0.45 0.30 0.28 SONG 0.47 
Macedonia 441.43 102.24 oS 0.47 —0.37 0.39 0.12 0.35 0.36 0.28 0.34 0.32 
England 553.42 85.27 0.10 0.33 0.55 0.35 0.03 0.26 0:27 0.36 0.18 0.18 
Scotland 528.54 84.57 0.14 0.35 0.36 0.38 0.07 0.27 —0.14 0.42 —0.08 0.20 


Note. SES = socioeconomic status. 
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To What Extent Do Teacher-Student Interaction Quality and Student 
Gender Contribute to Fifth Graders’ Engagement in Mathematics Learning? 
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This study examines concurrent teacher-student interaction quality and Sth graders’ (n = 387) engage- 
ment in mathematics classrooms (n = 63) and considers how teacher-student interaction quality relates 
to engagement differently for boys and girls. Three approaches were used to measure student engagement 
in mathematics: Research assistants observed engaged behavior, teachers reported on students’ engage- 
ment, and students completed questionnaires. Engagement data were conducted 3 times per year 
concurrent with measures of teacher—student interaction quality. Results showed small but statistically 
significant associations among the 3 methods. Results of multilevel models showed only 1 significant 
finding linking quality of teacher—student interactions to observed or teacher-reported behavioral en- 
gagement; higher classroom organization related to higher levels of observed behavioral engagement. 
However, the multilevel models produced a rich set of findings for student-reported engagement. 
Students in classrooms with higher emotional support reported higher cognitive, emotional, and social 
engagement. Students in classrooms higher in classroom organization reported more cognitive, emo- 
tional, and social engagement. Interaction effects (Gender X Teacher—student interaction quality) were 
present for student-reported engagement outcomes but not in observed or teacher-reported engagement. 
Boys (but not girls) in classrooms with higher observed classroom organization reported more cognitive 
and emotional engagement. In classrooms with higher instructional support, boys reported higher but 
girls reported lower social engagement. The discussion explores implications of varied approaches to 
measuring engagement, interprets teacher—student interaction quality and gender findings, and considers 


the usefulness of student report in understanding students’ math experiences. 
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There have been remarkable shifts in mathematics education in 
the past 2 decades. The stance advanced by the National Council 
for Teachers of Mathematics (NCTM, 2000) and codified in the 
U.S. Common Core State Standards Initiative (CCSSI, 2014) de- 
scribes learning mathematics as a dynamic, exploratory process 
focused on creating opportunities for students to develop a con- 


ceptual understanding of mathematics. Students are expected to 
identify and describe patterns in mathematics, participate in con- 
versations about mathematical problem solving, use mathematics 
to think and reason, and justify their mathematical thinking. The 
new emphasis contrasts with a traditional view describing mathe- 
matics education as a static set of facts, concepts and procedures to 
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be learned and memorized (Henningsen & Stein, 1997; Hiebert & 
Grouws, 2007; Schoenfeld, 1992). 

Standards-based mathematics education heightens the demand 
for students to be actively engaged in learning. Imagine the chal- 
lenge from the perspective of a fifth grade teacher striving to 
produce high levels of engagement in her classroom. The teacher 
may have been handed clear guidelines on what to teach; for 
instance, CCSSI guidelines emphasize multiplication and division 
of fractions and fluency in operations with multidigit whole num- 
bers and decimals to the hundredths (CCSSI, 2014). However, how 
to teach in a way that fully engages students is much less clear. 
Fifth graders are experienced students and often know how to 
comply and appear engaged in learning. However, learning math- 
ematics requires much more than simply the appearance of en- 
gagement. Students need to feel engaged in math learning for the 
instruction to take hold. Students need to pay attention, become 
interested in the mathematical ideas, and even work together with 
others on math problems. The heightened demands establish the 
need for research that identifies conditions that foster student 
engagement. 

Despite the large body of research on engagement, few studies 
focus exclusively on math, little work measures teacher behaviors 
and student engagement concurrently, and work using student- 
report data is scarce (Christenson, Reschly, & Wylie, 2012; 
Fredricks, Blumenfeld, & Paris, 2004). Finn and Zimmer (2012) 
have stated a need for research on classroom contexts that support 
and threaten student engagement: “A package of assessments for 
this purpose would involve observations of students in the school 
setting, observations of teacher—student interactions (with specific 

_ foci), and reactions from students themselves” (p. 125). 

The present study addresses this stated need. Five unique con- 
tributions stand out. First, we gather student engagement data 
based on observational, teacher-report, and student-report mea- 
sures. Second, we measure teacher—student interaction quality and 
engagement concurrently to understand the temporal coupling 
between teachers’ behaviors and students’ experience. Third, we 
gather data in mathematics classrooms only, whereas most re- 
search on teacher—student interaction quality is not content spe- 
cific. Fourth, we consider various facets of teacher—student quality. 
We examine teacher sensitivity and supportiveness, the approach 
to behavior management, and opportunities for higher order think- 
ing and back-and-forth conversation between teachers and stu- 
dents. Fifth, we examine the extent to which classroom conditions 
are equivalently important for girls and boys. Thus, our goal is to 
identify immediate classroom conditions that enhance and dimin- 

ish engagement for fifth gradé boys and girls. The work was 
designed to provide basic research insights for mathematics edu- 
cators concerned with leveraging teacher—student interaction qual- 
ity to improve engagement in math learning. 


~ Theoretical Perspective 


The work was guided by an integrative framework of motivation 
(Skinner, Kindermann, Connell, & Wellborn, 2009) that explains 
how characteristics of children’s contexts contribute to self- 
systems and self-perceptions, which lead to action (engagement or 
disaffection) and ultimately to social, emotional, and academic 
outcomes (Skinner et al., 2009). Skinner et al. (2009) defined 
children’s contexts as settings composed of peers, teachers, family 


members and others with whom children engage in social interac- 
tions and activities. Self-systems and self-perceptions refer to 
children’s beliefs, cognitive appraisals, and perceptions of them- 
selves that develop in children as a result of their past experiences, 
mold children’s interpretation of their experiences, and play an 
important role in motivating children’s behavior. Action refers to 
engagement versus disaffection; each reflects an outward signal of 
motivational state and describes the quality of children’s interac- 
tions with their physical and social world. The outcomes include 
social, cognitive and personality development. 

The integrative framework of motivation describes motivation 
as a dynamic, developing characteristic that is sensitive to contexts 
external to the child (e.g., interactions with teachers). Further, the 
framework introduces the utility of measuring a child’s engage- 
ment at a particular point in time as one way to “capture the target 
definitional manifestations of motivation—namely, energized, di- 
rected, and sustained action” (Skinner et al., 2009, p. 225). This 
view applies to students in elementary math classrooms. Most fifth 
grade students do not come to math class as “engaged” or “disen- 
gaged.” Students’ engagement in math class varies depending on 
their interactions with teachers, peers, and materials (Connell & 
Wellborn, 1991: Skinner & Belmont, 1993). Further, students’ 
engagement varies across days, weeks, and months—a student 
who appears engaged in mathematics instruction one day may be 
less engaged in math class on a day 1 full month later. 

We apply the integrative framework of motivation to understand 
day-to-day interactions between teachers and students. Each year 
of math instruction is composed of daily experiences that vary in 
quality and accrue to create a cumulative experience for students. 
We disaggregate the year of math instruction by sampling specific 
days and assessing the immediate correspondence between teach- 
ers’ interactions with students and students’ engagement. The 
work is situated at an important point developmentally; fifth grad- 
ers are capable of reflecting upon and reporting their engagement 
in learning, and fifth grade marks a turning point when boys begin 
to outperform girls in mathematics (Robinson & Lubienski, 2011). 


Engagement in Learning 


Engagement has been described as “the glue, or mediator, that 
links important contexts—home, school, peers, and communi- 
ty—to students and, in turn, to outcomes of interest” (Reschly & 
Christenson, 2012, p. 3). Existing research establishes that engage- 
ment is critical for learning and that engagement forecasts school 
success. Students who stay on task, attend to learning goals, and 
participate actively in the learning experience show better aca- 
demic achievement in elementary school (Fredricks et al., 2004; 
Greenwood, Horton, & Utley, 2002; Hughes & Kwok, 2007; Ladd, 
Birch, & Buhs, 1999; Ponitz, Rimm-Kaufman, Grimm, & Curby, 
2009; Reyes, Brackett, Rivers, White, & Salovey, 2012; Tucker et 
al., 2002). 

Definitions of engagement vary considerably; a three-part def- 
inition of engagement that includes behavioral, cognitive, and 
emotional engagement is most prevalent (Reschly & Christenson, 
2012: Fredricks et al., 2004). Behavioral engagement refers to 
paying attention, completing assigned work, participating in 
teacher-sanctioned learning opportunities, and showing an absence 
of disruptive behaviors. Cognitive engagement refers to a willing- 
ness to exert effort to understand content, work through difficult 
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problems, and manage and direct their attention toward the task at 
hand. Emotional engagement refers to feelings of connection to 
content, interest in learning, and enjoyment of solving problems 
and thinking about content (Fredricks et al., 2004). A fourth 
construct, social engagement, is also fundamental. Social engage- 
ment (termed “task-related interaction” by Patrick, Ryan, & Ka- 
plan, 2007) refers to students’ day-to-day social exchanges with 
peers that are tethered to the instructional content. Standards-based 
math instruction emphasizes activities involving small groups of 
students and mathematical discourse among students (Fuson, Kal- 
chman, & Bransford, 2005; NCTM, 2000). 

Although conceptualizations of engagement vary, there are four 
common themes: (a) Engagement is a critical mediator for learn- 
ing; (b) engagement is multifaceted with behavioral, cognitive, 
emotional, and social elements; (c) different sources of data are 
necessary depending on the type of engagement measured; and (d) 
students show a decrease in engagement in learning as they prog- 
ress from elementary school into the middle school years (Furrer & 
Skinner, 2003; Marks, 2000; Reschly & Christenson, 2012; Reyes 
et al., 2012). Engagement theories describe dynamics within the 
engagement system (e.g., emotional engagement stimulates behav- 
ioral engagement) and outside of the engagement system (e.g., 
social context contributes to behavioral engagement; Reschly & 
Christenson, 2012; Skinner et al., 2009). We focus on a factor 
outside of the engagement system, teacher—student interaction 
quality, and consider the extent to which it contributes to behav- 
ioral, cognitive, emotional, and social engagement. 


Teacher-Student Interactions 


Teachers’ interactions with students vary in quality and have 
appreciable effects on math achievement outcomes (Martin, An- 
derson, Bobis, Way, & Vellar, 2012; Reyes et al., 2012). Teacher— 
student interactions are malleable features of classroom environ- 
ments and have been the focus of national efforts to raise 
mathematics achievement (Pianta & Hamre, 2009; Rimm- 
Kaufman & Hamre, 2010). 


Quality of Teacher—Student Interactions 


Teacher—student interaction quality can be described in relation 
to three domains: emotional, organizational, and instructional sup- 
port (Pianta & Hamre, 2009). Emotional support refers to the 
teachers’ connection to and responsiveness toward students, 
awareness of students’ individual differences and needs, and will- 
ingness to incorporate students’ point of view into learning activ- 
ities. Classroom organization refers to the teachers’ tendency to 
use proactive rather than reactive supports to foster classroom 
routines and guide classroom behavior, use instructional ap- 
proaches that make learning objectives clear, and use a variety of 
modalities to engage students in learning. Instructional support 
refers to the presence of feedback loops in teacher—student com- 
munication and provision of opportunities to engage in higher 
order thinking and learn new language and vocabulary (Pianta, La 
Paro, & Hamre, 2008). 

Research links teachers’ emotional support (i.e., positive class- 
room social climate, teacher sensitivity toward students) to en- 
hanced engagement in kindergarten (Rimm-Kaufman et al., 2002) 
and third grade classrooms (NICHD Early Child Care Research 


Network, 2005). Teacher efforts to reinforce prosocial behavior in 
sixth and seventh grade contribute to enhanced behavioral and 
social engagement (Matsumura, Slater, & Crosson, 2008). Meta- 
analytic work has demonstrated associations between positive 
teacher affect and engagement and between negative teacher affect 
and disengaged behavior (Roorda, Koomen, Spilt, & Oort, 2011). 
Engagement plays a mediational role linking emotional support to 
achievement in both upper elementary (Reyes et al., 2012) and 
middle school grades (Voelkl, 1995). 

High quality classroom organization has been linked to engage- 
ment in kindergarteners, first graders (Ponitz, Rimm-Kaufman, 
Brock, & Nathanson, 2009; Rimm-Kaufman, Curby, Grimm, Na- 
thanson, & Brock, 2009), and third graders (NICHD Early Child 
Care Research Network, 2005). Teachers who establish clear rou- 
tines in the fall appear to increase the self-regulated behavior of 
their students throughout the school year (Bohn, Roehrig, & Press- 
ley, 2004; Cameron, Connor, & Morrison, 2005). Third graders in 
classrooms with higher levels of productivity and more opportu- 
nities to engage in academic instruction spend more time behav- 
iorally engaged (NICHD Early Child Care Research Network, 
2005). 

Instructionally rich learning environments are also likely to 
support engagement. The presence of authentic learning experi- 
ences (i.e., provision of interesting questions and opportunities for 
in-depth learning) relates to increased student engagement in math 
learning in elementary and middle school years (Marks, 2000). In 
sixth and seventh grade classrooms, teachers who asked students 
challenging questions and encouraged students to explain the 
evidence behind their statements in classroom discussions en- 
hanced the quality of the discourse (Matsumura et al., 2008), and 
thus, participatory engagement. Middle school teachers who were 
observed showing high expectations for student work, monitoring 
student progress and providing scaffolding, and challenging stu- 
dent thinking produced higher levels of student engagement (Ra- 
phael, Pressley, & Mohan, 2008; Woodward et al., 2012). 


Gender 


Student gender has been linked to engagement; boys show lower 
levels of behavioral and emotional engagement than girls in ele- 
mentary and middle school (Kindermann, 2007; Marks, 2000). 
Most work demonstrating gender differences in engagement gen- 
eralizes across content areas and is not specific to math. The fifth 
grade year appears to be an important time to compare the engage- 
ment of boys and girls in math because it marks an inflection point 
in achievement. From kindergarten to fifth grade, math achieve- 
ment disparities between boys and girls increase, with boys show- 
ing more achievement growth than girls. In middle school, the 
reverse is true, and girls show larger achievement increases than 
boys (Robinson & Lubienski, 2011). Gender disparities in engage- 
ment and achievement warrant further investigation in math class- 
rooms. 

Simply comparing engagement between boys and girls does not 
fully recognize the role of teacher—student interactions in the 
facilitation of engagement. In a study of elementary and middle 
school students, the higher level of engagement in girls than boys 
was attenuated in the presence of social support (i.e., student- 
reported teacher respect, feelings of safety at school, the presence 
of high expectations from their teachers, and opportunities to 
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discuss academic issues with their families; Marks, 2000). Other 
work has described boys as having more frequent and academi- 
cally challenging interactions with their teachers than girls (Na- 
tional Research Council, 2001). Too little is known about how 
teachers’ interactions with students are differentially important to 
boys’ versus girls’ concurrent engagement in fifth grade math. 


Informants of Engagement 


We used three categories of informants to assess engagement. 
Observers measured behavioral engagement; teachers reported on 
behavioral engagement; and students reported on their cognitive, 
emotional, and social engagement. Each information source pro- 
vides a unique perspective. Classroom observational methods mea- 
sure observable indicators of engagement, such as behavioral 
engagement (Brophy & Good, 1986; NICHD Early Child Care 
Research Network, 2005) but do not capture students’ internal 
psychological experience. Teachers’ ratings provide global indi- 
cators of students’ engagement and provide cumulative reports of 
engagement over a year but also tap teachers’ subjective beliefs 
about students (Gest, Domitrovich, & Welsh, 2005; Mashburn, 
Hamre, Downer, & Pianta, 2006). Student-report methods provide 
students’ own perspective of their psychological experience and 
may be better for measuring intrapsychic experiences; however, 
student-report methods may be sensitive to social desirability bias. 
We included three informants because each reporter provides a 
unique perspective on students’ engagement. 


Other Factors 


Several student attributes with theoretical or empirical links to 
mathematics engagement were included as covariates. Age was 
included because of its association with self-regulatory abilities in 
school settings (Bronson, 2001). Eligibility for free or reduced 
priced lunch (FRPL) was used as an indicator of low income, 
elementary school-aged children living in impoverished environ- 
ments experience confrontation with chronic stressors linked to 
lower self-regulatory abilities and engagement (Evans & English, 
2002; Evans & Rosenbaum, 2008). Initial achievement was in- 
cluded because of associations between higher math achievement 
and emotional engagement, and because rate of growth in math 
learning differs for students with preexisting academic difficulty 
compared to typical students (Bodovski & Farkas, 2007; Crosnoe 
et al., 2010; Dotterer & Lowe, 2011). Self-efficacy in math, 
defined as students’ perception of their capacity to learn or per- 
form in math, was included because of established links to en- 
gagement (Linnenbrink & Pintrich, 2003; Schunk & Pajares, 
2005). Time of year was included because of changes in student 
experience over the year (Curby, Rimm-Kaufman, & Abry, 2013). 


The Present Study 


We address three questions. First, to what extent do observa- 
tionally based, teacher-reported, and student-reported measures of 
engagement show concordance and discordance? We hypothesized 
stronger associations within informants than among informants. 
We expected that varied approaches to measurement would pro- 
vide different lenses on engagement. Second, to what extent do the 
quality of teacher—student interactions and student gender contrib- 


ute to engagement? We expected higher engagement among girls 
than boys and expected higher quality teacher—student interactions 
to relate to higher engagement. Third, does the quality of teacher— 
student interactions predict student engagement differentially for 
boys and girls? We expected higher quality teacher—student inter- 
actions would be more important for engaging boys than girls. 


Method 


Participants 


All schools were located in a single suburban district in a 
Mid-Atlantic state. Schools and fifth grade teachers were recruited 
by the research team through in-person meetings with principals 
and teachers. Response rates were 83% and 79% for schools and 
teachers, respectively. The selected schools (NV = 20) were socio- 
economically and linguistically diverse; 33% of students qualified 
for FRPL, and 31% were English language learners (ELL). Sixty- 
three fifth grade mathematics teachers participated. Teachers had, 
on average, 12.49 years of experience (range = 1-38). Most 
teachers were Caucasian (n = 48); five were Hispanic, one was 
African American, one was Native American, and two were mul- 
tiracial. Six teachers did not report their race/ethnicity. All teachers 
held bachelor’s degrees; 38 had master’s degrees. All teachers 
reported having a full state certification. Teachers received finan- 
cial remuneration for participating. 

Fifth grade students were recruited via mailings sent home to all 
parents by participating teachers in the fall of students’ fifth grade 
year. Family recruitment practices followed customary district 
procedures for family communication, involving translation into 
seven commonly spoken languages. Parents of 479 students signed 
consent forms and received gift certificates for participating. 

Approximately five students per classroom (mode = 5) were 
selected from the 479 consented students, resulting in the sample 
of 387. Selection was conducted randomly for each classroom 
bounded by two constraints: (a) maintenance of equal number of 
girl and boy participants, and (b) demographic match to the whole 
school (based on ethnicity, FRPL, and ELL percentages). The final 
sample of student participants (1 = 387; 203 girls) were 10.47 
years old (SD = 0.37) in September 2010. School records showed 
that 21% of students qualified for FRPL (income of $40,793 for a 
family of four, roughly below 180% of the federal poverty guide- 
line). Parent-report questionnaires (described below) showed that 
55% of students spoke primarily English at home, 28% spoke a 
non-English language (22 different languages reported), and 17% 
had missing data. Of the 321 parents reporting mothers’ education, 
7.2% did not have a high school diploma, 21.5% had a high school 
diploma, and 71.3% had an associate’s degree or above. 


Procedures 


The research team began by conducting extensive, iterative pilot 
work in 2009-2010 with 33 fifth grade students and six fifth grade 
teachers. Existing, well-validated measures were used to measure 
engagement, when available. The research team reviewed existing 
measures and found that typical engagement measures were not 
necessarily well suited for fifth graders, math, or reflections on one 
specific day of class. Pilot work was conducted to adapt existing 
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observational and student-report engagement measures to meet 
high levels of rigor. 

Data were gathered from five sources: (a) district data, (b) parent- 
report questionnaire, (c) classroom observations, (d) student-report 
questionnaires, and (e) teacher-report questionnaires. All data were 
gathered while students were enrolled in fifth grade except for initial 
student achievement data, which was gathered by the district as part 
of the end-of-the-year fourth grade standardized testing. Data were 
gathered between May 2010 and May 2011, as shown in Table 1. 

Data gathered by the district were used to measure student 
FRPL and initial achievement. Parents completed a demographic 
questionnaire at fall recruitment to describe student and sample 
characteristics (e.g., gender, age, mother’s education). Pairs of 
research assistants conducted classroom observations in math 
classes at three times during the school year, corresponding to 
three windows (Window 1: late September to late November; 
Window 2: late November to mid-February; and Window 3: late 
February to late April). At each observation, one research assistant 
videotaped the classroom to gather teacher—student interaction 
quality data and a second research assistant live-coded student 
engagement (as described below). The research assistants admin- 
istered engagement questionnaires to student participants immedi- 
ately after each observation. After the fall observation only, stu- 
dents completed a questionnaire about their math self-efficacy. In 
spring, teachers completed a teacher demographic questionnaire 
and questionnaires about each participant to measure student en- 
gagement in mathematics learning. 

All classroom visits were scheduled and conducted following a 
specific protocol. Research assistants scheduled classroom obser- 
vations of teachers and students on days that teachers deemed 
“typical days” of math instruction. Observations were scheduled 
for 3 different days during the school year to sample typical 
practices from over 3 hr of observation. Classroom observations 
were conducted for the full length of the math lesson (M = 63 min, 
range 15 to 135). One research assistant began videotaping the 
classroom prior to the transition to math instruction and ended at 
the end of the math lesson. The second research assistant (child 


Table 1 
Timeline for Data Collection From Five Data Sources 


observer) conducted two 4-min observations of engagement for 
each child participant during the same time in which the teacher 
was obseryed. Child observers followed a protocol in which they 
would watch one student for 4 min, complete ratings, watch a 
second student for 4 min, complete ratings, and so on, until all 
student participants had been observed and rated once. Then, the 
child observer would cycle through the student participants a 
second time, observing each student for 4 min and completing 
ratings again. Child observations resulted in 24 min of observed 
engagement per child. Child observers made efforts so that stu- 
dents were unaware that they were being observed. When the math 
lesson observation was complete, the research assistants distrib- 
uted student-report engagement questionnaires to student partici- 
pants to measure their engagement in mathematics on that specific 
day. All classroom videotapes were sent to the laboratory for 
subsequent coding. : 


Measures 


District data. 

Student demographic data. 
termine eligibility for FRPL. 

Initial achievement. The paper version of the state standard- 
ized test, the Standard of Learning (SOL), was administered by the 
district to assess fourth grade mathematics achievement (Virginia 
Department of Education [VDOE], 2008). The test was composed 
of 50 multiple choice items tapping students’ procedural knowl- 
edge and conceptual understanding of four skill categories: (a) 
number and number sense, (b) computation and estimation, (c) 
measurement and geometry, and (d) probability, statistics, pat- 
terns, functions and algebra (VDOE, 2010). The state computed 
the total number of items correct and converted the number to a 
scaled score ranging from 0 to 600. A scaled score of 400 indicates 
pass/proficient, and 500 indicates pass/advanced. The Virginia 
Standards of Learning Technical Report (VDOE, 2008) describes 
test development, calibration, and validity. Test items were devel- 
oped through a collaborative process among Virginia educators, 


District records were used to de- 
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District data 
Student demographic information x 
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Parent-report questionnaire 
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Classroom observations 
Teacher-student interaction quality 
Observed engagement 
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Student-report questionnaires 
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VDOE, Educational Testing Service, Pearson, and content experts 
based upon test blueprints. Calibration was established using Ra- 
sch modeling and the partial credit model. Test validity was 
established by gathering empirical evidence supporting the face 
validity, intrinsic rational validity, content validity, and construct 
validity (VDOE, 2008). Students deemed not proficient in English 
were administered the Plain English Math version that equates to 
the standard math assessment. 

Parent-report questionnaires. 

Student demographic information. Parents completed cus- 
tomized questionnaires to describe sociodemographic characteris- 
tics. Gender was coded as 1 for female. Child age in September 
2010 was computed based on birthdate. Parents reported primary 
language spoken at home and level of maternal education. 

Classroom observations. 

Teacher-student interaction quality. Quality of teacher- 
student interactions was assessed using the Classroom Assessment 
Scoring System (CLASS; Pianta et al., 2008). There are 10 mea- 
sured dimensions that correspond to three domains (emotional 
support, classroom organization, and instructional support). Emo- 
tional support was measured using dimensions of positive climate, 
negative climate, teacher sensitivity, and regard for student per- 
spectives (a = .83). Positive climate referred to a positive emo- 
tional tone among teachers and students and referred to respect, 
enthusiasm, and evidence of enjoyment. Negative climate (re- 
versed for analysis) tapped teachers’ evidence of sarcasm, anger, 
aggression and/or harshness. Teacher sensitivity measured evi- 
dence of the teacher providing comfort, reassurance and encour- 
agement in relation to students’ academic and social needs. Regard 
for students’ perspectives referred to situations in which teachers’ 
choice of classroom activities demonstrated emphasis on students’ 
motivation, interests, and point of view. 

Classroom organization was assessed using scales of behavior 
management, productivity, and instructional learning formats (a = 
.82). Behavior management measured teachers’ use of effective 
methods to prevent students’ behavior problems and redirect stu- 
dents, as needed. Productivity referred to the teachers’ use of 
instructional time and routines enabling appropriate learning op- 
portunities for students. Instructional learning formats referred to 
teachers’ use of materials and activities to facilitate learning op- 
portunities. 

Raters assessed three dimensions in relation to instructional 
support for learning (a = .72). Concept development measured the 
teachers’ use of strategies to promote students’ higher order think- 
ing. Quality of feedback assessed specificity of teachers’ verbal 
interaction pertaining to student work, ideas and comments (e.g., 
did teacher comments create communication loops between the 
teacher and students). Language modeling measured the extent to 
which teachers facilitated, encouraged and modeled students’ use 
of advanced language. Each dimension was rated on a 7-point 
Likert scale. For analyses predicting observed and student-reported 
engagement, CLASS domains collected concurrently with the en- 
gagement outcome were used in models. For analyses predicting 
teacher-reported student engagement, mean levels of domains 
were calculated for each teacher across the three observation 
windows. 

Prior to CLASS training, coders read manuals and additional 
readings and conducted practice observations. Training involved 
the time equivalent of a 2-day small group interactive training 


followed by paired observations with an expert. Reliability tests 
involved rating ten 15-min segments for CLASS. Ratings were 
compared to a gold standard, prepared by the instruments’ authors. 
To be considered reliable, each coder’s responses had to be within 
1 scale point of the gold standard on 80% of the responses. 
Reliability exceeded these levels prior to data collection. Calibra- 
tion involved independent coding (once or twice per month) in a 
small group session followed by reliability checks and discussion 
of coding rationales plus double coding of more than 10% of tapes 
selected randomly. In addition, master coders conducted audits by 
coding one tape coded by each coder every 12 weeks. Measures of 
teacher—student interaction quality were based on two segments 
(minutes 0-15 and 30-45). 

Observed behavioral engagement. Research assistants as- 
sessed student engagement using time-sampling and global rating 
systems adapted from the NICHD Early Childcare Research Net- 
work (2005) Classroom Observation Scale (COS). The COS had 
been used in fifth grade classrooms (Pianta et al., 2008) but 
required honing, revised documentation, and testing, all of which 
were conducted during the pilot year. The time sampling measure 
was a low-inference measure and required an observer to note the 
presence or absence of disengagement in 1-min intervals. Disen- 
gagement included wandering, looking away from instructional 
opportunities, behaviors disruptive to learning, and similar behav- 
iors listed in a manual. Students were observed and coded for four 
consecutive 1-min intervals, twice during each math lesson. Ob- 
served on-task behavior was calculated as minutes observed minus 
minutes of disengagement. 

The global rating was a high-inference measure composed of 
three rating scales: (a) participation in learning opportunities (e.g., 
duration and interest of involvement), (b) disruptive behavior 
(reversed; e.g., excessive out-of-turn talking, sustained noise), and 
(c) self-reliance (e.g., self-management of materials and responsi- 
bilities). Research assistants took notes related to global codes 
during the 4-min time-sampling period. Then, the research assis- 
tant used the notes and a scoring rubric to rate behavior from 1 
(low) to 7 (high) on each scale. 

Reliability training involved following a protocol with a four- 
phase process (preparation, training, reliability, and ongoing cali- 
bration) to attain and maintain reliability. The process was com- 
parable to that described for the CLASS. Reliability values prior to 
data collection (based on eight segments) and during monthly 
calibration (based on eight segments), respectively, showed an 
intraclass correlation of .65 and .95 for the time sampling measure 
and 75% and 90% within one match for tiie global rating. Initial 
values were lower than desired, but later tests of reliability indi- 
cated substantial improvements. Research assistants conducted 
paired coding during initial visits until reliability improved. The 
time sampling score and global ratings of behavioral engagement 
were correlated (r = .70). All four scores were included in a 
confirmatory factor analysis to create a factor score. 

Student-report questionnaires. 

Students’ feelings of math efficacy. The Academic Efficacy 
subscale of the Patterns of Adaptive Learning Scale (Midgley et 
al., 2000) was used to measure students’ perception of their com- 
petence. The subscale was modified to apply to a math context and 
piloted and validated in a sample of 39 students (pilot a = .89). 
Students rated items such as, “I’m certain I can master the skills 
taught in math this year,” and “I can do almost all of the work in 
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math if I work at it” on a scale from 1 (almost never) to 4 (all the 
time). The five items were averaged to create a composite value of 
math efficacy where higher values indicated higher efficacy (a = 
81). 

Student-reported engagement in math class. The student- 
reported measures of cognitive and emotional engagement were 
developed based upon measures created by Meece (2009); Kong, 
Wong, and Lam (2003); Rowley, Kurtz-Costes, Meyer, and Kizzie 
(2009); and Skinner and Belmont (1993) to assess students’ report 
of cognitive and emotional engagement in relation to a specific day 
and the math context. Development and piloting of student-report 
measures of cognitive and emotional engagement involved selec- 
tion of items by two mathematics education experts, adjustment of 
wording to measure a single day in math, and review by a research 
team. This process was followed by piloting the measure with 
students immediately after their math instruction. Researchers 
observed the students and then distributed measures. Students 
rated their engagement and identified confusing items. Research 
assistants engaged in conversations about cognitive and emotional 
engagement with selected students for content validation. This 
process was conducted serially following protocols. 

The student-report measure of social engagement developed and 
used by Patrick et al. (2007) was adopted in its existing form (with 
the only modification involving addition of the phrase “in math 
class”). The measure of social engagement was composed of five 
items that measured the extent to which students explained aca- 
demic content to one another and discussed ideas with other 
students in class. 

Students reported on engagement in a 15-item questionnaire 
with a scale from 1 (no, not at all true) to 4 (yes, very true). The 
questionnaire was piloted in a sample of 33 fifth graders in three 
schools prior to use, resulting in an alpha for the complete measure 
of .90 and correlations exceeding .50 (p values < .01), with 
analogous measures given to teachers simultaneously. Using data 
from the present study (n = 387), a confirmatory factor analysis 
resulted in a three-factor solution representing subconstructs of 
cognitive engagement, emotional engagement, and social engage- 
ment, with alphas of .78, .91, and .74, respectively. The alpha 
values for cognitive and social engagement were not as high as 
desired; however, we chose to use these factors in subsequent 
analyses because of solid factor loadings and high fit indices (see 
Table 3). Factor scores for each subconstruct were used in analy- 
ses. Further validation stems from relatedness of the three subcon- 
structs to other constructs. Factor scores for cognitive, emotional, 
and social engagement correlated .56, .67, and .49, respectively, 
with student-reported feelings about school. 

Teacher-report questionnaires. 

Teacher-reported engagement. Teachers reported on behav- 
ioral engagement in math using an eight-item version of the 
student engagement questionnaire used by Wu, Hughes, and Kwok 
(2010) and Skinner, Furrer, Marchand, and Kindermann (2008), 
and adapted to include the phrase, “in math class.” Teachers rated 
each item from 1 (strongly disagree) to 4 (strongly agree). Items 
included “This student pays attention in math class” and “This 
student participates in discussion in math class.” Factor analysis 
confirmed a one-factor solution. This is consistent with previous 
use of the measure as a single scale and is supported by a high 
internal consistency-reliability estimate (a = .92). This factor 
score correlated significantly with fourth grade achievement (r = 


.29, p < .01) and showed predictive validity to fifth grade achieve- 
ment based on other analyses conducted as part of this study. 

Teacher demographic questionnaire. Teachers reported gen- 
der, ethnicity, education, years of experience, certification, and 
other demographic characteristics in a questionnaire. 

Analytic approach. The initial step involved the reduction 
of engagement data. We conducted separate confirmatory factor 
analyses (CFA) for each measure of engagement using Mplus 
6.12 (Muthén & Muthén, 2010). The CFA utilized a priori 
decisions about factors drawn from theory, previous research, 
and item source (Kong et al., 2003; Meece, 2009; Rowley et al. 
2009; Skinner & Belmont, 1993). Observed behavioral engage- 
ment and teacher-reported engagement were hypothesized to 
have only one factor based upon theory and study design, 
whereas student-reported engagement was hypothesized to have 
three factors. Following each CFA, a factor score was generated 
for use in analyses. 

Observed behavioral engagement and student-reported en- 
gagement were collected three times per year, whereas the 
teacher-reported engagement measure was collected only once. 
For Question 1 only, we aggregated observed behavioral and 
student-reported engagement measures across the three obser- 
vations, resulting in one value per student for each of the 
following: observed behavioral engagement, self-reported cog- 
nitive engagement, emotional engagement, and social engage- 
ment. Descriptive statistics were calculated to understand basic 
data patterns. 

Question 1 examined concordance and discordance between 
observationally based, teacher-reported, and student-reported 
math engagement. Bivariate correlation coefficients were com- 
puted and examined. Questions 2 and 3 involved multilevel 
modeling to account for clustering effects (observations nested 
within students, students nested within teachers, and teachers 
nested within schools) using PROC MIXED in SAS (Version 
9.2). Questions 2 and 3 used the five dependent variables 
(created from the abovementioned CFA): (a) observed behav- 
ioral engagement, (b) teacher-reported behavioral engagement, 
(c) student-reported cognitive engagement, (d) student-reported 
emotional engagement, and (e) student-reported social engage- 
ment. Both observed and student-reported engagement data 
were gathered at three time points. Instead of aggregating the 
data across time for these outcomes, we handled the longitudi- 
nal nature of the data via random effects in SAS PROC MIXED. 

Model assumptions. Multilevel modeling assumes normal- 
ity of the residuals, linear relationships between variables, no 
outliers, and an appropriate method for handling missing data 
(Kline, 2011; Little & Rubin 1987). Data were examined 
through residuals plots, histograms, and scatterplots. Assump- 
tions were met for normality of the residuals and linear rela- 
tionships between variables. No outliers were apparent. 
Roughly 5% of the data were missing for all the covariates. 
Analyses were conducted to determine the type of missing data 
via bivariate correlations and logistic regression. Considering 
the exhaustive nature of the covariates and the fact that missing 
data analyses revealed no systematic trends, the data were 
determined to be most likely missing at random. Subsequently, 
data were imputed in Mplus (Muthén & Muthén, 2010) while 
accounting for the clustering of the data. 
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Multilevel model building. Models were built incrementally 
in three steps. First, we created a basic four-level model that 
included gender and all covariates. Second, to address Question 2, 
we added one teacher—student interaction quality variable at a time 
to the basic model. Finally, to address Question 3, we generated 
models with one teacher—student interaction quality variable (e.g., 
emotional support) plus its corresponding interaction with gender 
(e.g., Gender X Emotional support). The decision to include one 
teacher—student interaction variable at a time versus all three 
teacher—student interaction variables simultaneously involved ad- 
ditional consideration. Next, we describe the model building pro- 
cess and then we describe the decision about how we handled 
teacher—student interaction variables. 

The basic model involved two observation-level variables 
(weeks in the year, weeks in the year squared), five student- 
level variables (gender, age, FRPL, initial achievement, self- 
efficacy), two teacher-level variables (master’s degree, years of 
experience), and no school-level variables. To address Question 
2, examining gender and teacher—student interaction quality 
main effects, we added one teacher-level variable (emotional 
support, classroom organization, or instructional support) to the 
basic model. To address Question 3, examining gender by 
teacher—student interactions, we used the basic model plus a 
cross-level interaction between gender and one domain of 
teacher—student interaction quality (Gender < Emotional sup- 
port, Gender X Classroom organization, or Gender X Instruc- 
tional support). Each model included all the predictors that were 
part of the basic model. Centering was unnecessary because 
gender was binary at the student level. We conducted post hoc 
contrasts to test significance of the slopes for boys and girls 
separately. 

This approach to model building was the same for four of five 
engagement outcomes. Observed behavioral engagement and 
the three student-reported engagement constructs were mea- 
sured at three points during the year. Thus, we tested change 
over time as a linear term (weeks from the start of school) and 
curvilinear term (weeks from the start of school squared) to 
estimate slight curvature. Teacher-reported behavioral engage- 
ment was measured once. For that outcome, we used three-level 
models that excluded the observation level. 

The data measured longitudinally were analyzed using a 
repeated measure analysis with random effects in SAS PROC 
MIXED. This procedure allows time to be a within-subject 
factor because different measurements on the same student are 
at different times (which we refer to as observation level). Time 
was tested as a main effect in the model; in other words, the 
model was designed to answer the question of how a student 
changes as time progresses (with potential linear and curvilin- 
ear effects). The student was entered as a random effect 
permitting inferences to be made to the entire population of 
students who could have been in the study. The clustering 
effects of classrooms and schools were handled similarly as 
random effects. The multilevel models all had the same general 
form. 


Level 1: Observed Behavioral Engagement jj = Toijx 


sia T ijk (Weeks) ce TQijk (Weeks)? ais e, 


177 


Level 2: Tix = Booi + Bioy (Gender) + Byoi; (Age) 
+ B39 (FRPL) + B4o;; (Initial Achievement) 
+ Bsoij (Self — Efficacy) + r, 
Level 3: Booij = Yoooi + Yiooi (Masters Degree), 
+ Yo99; (Years of Experience), 
+ ‘y399, (Teacher — Student Interaction Quality Domain), + u, 


Level 4: Yoooi = TNhoooo ar E, 


where 
i= 1,..., 20 (schools) 
j = 1,..., 63 (classrooms) 
k =,1, ...;387 (students) 
tye So) (me, points). 


Levels 1, 2, 3, and 4 correspond to time (observation), student, 
teacher, and school-levels, respectively. Weeks and Weeks~ were 
set as fixed effects. Level 1 was omitted for the teacher-reported 
behavioral engagement model. 

Handling of correlated CLASS domains. Correlation coef- 
ficients showed associations among the CLASS domains (emo- 
tional support, classroom organization, instructional support) with 
coefficients ranging from .58 to .62. Including all three domains in 
the same model raises multicollinearity concerns. However, ana- 
lyzing each domain separately means that model results for each 
domain also contain information about the portion of variance 
shared across domains. Resolution involved a two-part approach: 
First, we analyzed each domain alone in separate models (keeping 
all covariates the same); second, we computed each model with all 
three domains entered simultaneously. We compared results and 
considered trade-offs. Results for emotional support and classroom 
organization were comparable regardless of analytic approach. 
However, in the model with all domains entered simultaneously, 
instructional support was negatively associated with each outcome. 
The negative association contradicted theory, hypotheses, and the 
positive relations evident in the zero order correlations. As a result, 
we decided to report results from models that included each 
CLASS domain separately, in keeping with work elsewhere 
(Avant, Gazelle, & Faldowski, 2011; Rudasill, Gallagher, & 
White, 2010). 


Results 


Factor Analysis 


The CFA for observed behavioral engagement (based on time- 
sampled frequency of engagement and global engagement ratings) 
revealed an excellent model fit for the hypothesized one-factor 
solution (CFI = .99, TLI = .97, RMSEA = .09, SRMR = .02). 
Table 2 shows factor loadings for observer-reported engagement. 
We hypothesized a one-factor solution for teacher-reported behav- 
ioral engagement. The CFA was conducted and the resulting fit 
statistics were excellent (CFI = .95, TLI = .93, RMSEA = .04, 
SRMR = .04). Results are shown in Table 3. Three types of 
student engagement were hypothesized for the student-report mea- 
sure: cognitive, emotional, and social engagement. For cognitive 
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Table 2 
Confirmatory Factor Analysis for Observer-Reported 
Behavioral Engagement 





Standardized factor 
Item loading 





Observed on-task behavior (based on frequency of 


engaged behavior) 0.81 
Participation in learning opportunities (based on 

global rating) 0.92 
Disruptive behavior (based on global rating) —0.56 
Self-reliance (based on global rating) 0.90 


Note. CFI = .99, TLI = 97, RMSEA = .09, SRMR = .02. 


engagement, two items (“I thought about other things instead of 
math in math class today” and “Today I only paid attention in math 
when it was interesting”) had relatively weak loadings compared 
to other items but were included because all factor loadings were 
significant (p < .01) and the model fit was excellent (CFI = .96, 
TLI = .96, RMSEA = .03, SRMR = .04). For emotional engage- 
ment and social engagement, the factor loadings were all relatively 
strong, as shown in Table 4. 


Descriptive Statistics and Correlations 


Mean values suggest that, on average, students were highly 
engaged in math class regardless of informant. Students were 
observed as behaviorally engaged 3.28 min of the 4.0 min ob- 
served and were rated by observers as 5.66 on behavioral engage- 
ment on a 1—7 scale (based on values obtained prior to the CFA). 
On average, student-reported engagement ranged from 3.01 to 3.42 
and teacher-reported student engagement was 3.05 on 1- to 4-point 
scales, indicating high engagement in learning. Table 5 shows 
descriptive statistics based on factor scores. 


Concordance and Discordance Between Measures of 
Engagement 


Addressing Question 1, correlation coefficients between pairs of 
engagement variables showed (a) all correlations were positive, 


Table 3 
Confirmatory Factor Analysis for Teacher-Reported 
Behavioral Engagement 





Standardized factor 
Item loading 


This student concentrates on doing his/her work 


during math class. 0.89 
This student works as hard as he/she can during 

math class. 0.90 
This student pays attention in math class. 0.89 
This student tries to learn as much as he/she can 

about math. 0.89 
This student’s attention seems to wander during math 

class (reversed). 0.52 
This student participates in discussions in math class. 0.72 
This student asks off-topic questions during math 

class (reversed). 0.48 
This student doesn’t try very hard in math class 

(reversed). 0.79 


Note. CFI = .95, TLI = 93, RMSEA = .04, SRMR = .04. 


Table 4 
Confirmatory Factor Analysis for Student-Reported Engagement 
nce mS 


Standardized factor 
Item loading 
a 
Cognitive engagement 
Today in math class I worked as hard as I 
could. 0.58 
I thought about other things instead of math in 


math class today. 30:33 
Today I only paid attention in math when it 

was interesting. (31 
Today it was important to me that I 

understood the math really well. 0.68 
I tried to learn as much as I could in math 

class today. 0.75 
I did a lot of thinking in math class today. 0.64 

Emotional engagement : 

Math class was fun today. f 0.80 
Today I felt bored in math class. —(:63 
I enjoyed thinking about math today. 0.81 
Learning math was interesting to me today. 0.82 
I liked the feeling of solving problems in math 

today. 0.70 

Social engagement 

Today I talked about math to other kids in 

class. 0.59 
Today I helped other kids with math when 

they didn’t know what to do. 0.72 
Today I shared ideas and materials with other 

kids in math class. 0.65 
Students in my math class helped each other 

learn today. 0.59 


Note. CFI = .96, TLI = 96, RMSEA = .03, SRMR = .04. 


statistically significant, and ranged from .11 to .68; (b) correlation 
coefficients were higher within each measure (range from .49 to 
.68) than between measures (range from .08 to .24); (c) lowest 
correlation values were between student-reported and teacher- 
reported engagement values (range from .11 to .24); and (d) 
consistent with the CFA results, correlation values confirm that 
students’ view of their cognitive, emotional, and social engage- 
ment are related (r = .49 to .68) but have distinct characteristics 
(see Table 5). 


Main Effect of Teacher—-Student Interaction Quality 
and Student Gender on Engagement 


Question 2 examined the extent to which quality of teacher— 
student interactions and student gender contributed to engagement. 
As a preliminary step, descriptive statistics for covariates and 
teacher—student interaction quality, as well as their correlation with 
engagement, were computed (see Table 6). Mean levels of emo- 
tional support and classroom organization appeared higher than 
those for instructional support. Teachers showed a slightly more 
limited range of emotional support and classroom organization 
compared to instructional support. 

To address Question 2, we used the basic four-level model 
(labeled as Model 1 in Tables 7 and 8). We added one teacher-level 
variable (concurrent emotional support [Model 2A], classroom 
organization [Model 2B], or instructional support [Model 2C]) to 
the basic model. Table 7 shows results of the multilevel models for 


ENGAGEMENT IN FIFTH GRADE MATHEMATICS 179 


Table 5 


Intercorrelations and Descriptive Statistics for Engagement Variables 








Variable 1 2 3 + eS) 
1. Observed behavioral engagement a 
2. Teacher-reported behavioral engagement .23** (356) = 
3. Student-reported cognitive engagement al” (G84) ao ta(G50) = 
4. Student-reported emotional engagement .16** (384) .11* (356) 68" (384) — 
5. Student-reported social engagement .18™* (384) 24°" (356) ~—-.57™" (384) ~—s.49*™* (384) — 
M 0.00 3.05 3.42 3.30 3.01 
SD 0.83 0.64 0.43 0.62 0.60 
Min —0.94 1.00 1.50 1.30 1.38 
Max 3.04 4.00 4.00 4.00 4.00 
N 384 359 384 384 384 
Note. Sample sizes appear in parentheses. 
p05, gp = .01. 


observed and teacher-reported behavioral engagement. Students in 
classrooms with higher levels of classroom organization appeared 
more behaviorally engaged (b = .13, p < .01) than students in 
classrooms with lower levels of organization. Girls were observed 
to be more behaviorally engaged than boys (b = .19, p < .01). 
Pertaining to teacher-reported behavioral engagement, the quality 
of teacher—student interactions did not relate to teachers’ report of 
students’ engaged behavior. Teachers rated girls as more engaged 
than boys (b = .11, p < .10). 

Table 8 shows results of the multilevel models for student- 
reported engagement. A consistent pattern of findings emerged: 
Students in classrooms with teachers who provided more emo- 
tional support and higher quality classroom organization reported 
higher cognitive engagement (b = .03, p < .01; b = .03, p < .01, 
respectively), higher emotional engagement (b = .03, p< .01,b = 
.03, p < .01, respectively), and higher social engagement (b = .03, 
p < .01, b = .03, p < .01, respectively). Girls reported higher 
cognitive and social engagement than boys (b = .07, p < .01, b = 
.14, p < .01, respectively). 


Statistical Interactions Between Gender and 
Teacher-Student Interaction Quality 


Question 3 queried the statistical interaction between quality of 
teacher—student interactions and gender in predicting engagement. 
The multilevel models included the basic model; the main effects 
(tested in Question 2, i.e., concurrent emotional support [Model 
2A], classroom organization [Model 2B], or instructional support 
[Model 2C]); and one of three statistical interactions (Gender X 
Emotional support [Model 3A], Gender X Classroom organization 
[Model 3B], or Gender X Instructional support [Model 3C]). As 
shown in Table 7, none of the interactions between gender and 
CLASS domains were statistically significant for observed or 
teacher-reported behavioral engagement. However, one Gender X 
Teacher—student interaction quality effect emerged for each of the 
student-reported engagement outcomes, as shown in Models 3A, 
3B, and 3C in Table 8. Analyses showed a small interaction effect 
between gender and classroom organization for student-reported 
cognitive engagement (b = —0.06, p < .01). As classroom orga- 


Table 6 
Correlations and Descriptive Statistics of Covariates With Engagement 
Child Child Child Initial 
Engagement gender age FRPL achievement 

Observed factor score 

(behavioral) 21 —.06 .02 20s 
Teacher-reported 

(behavioral) AllOne —.08 .03 20% 
Student-reported 

(cognitive) 14" —.06 .06 zeta 
Student-reported 

(emotional) .05 =.02 16 .03 
Student-reported 

(social) sof .07 04 Sia 
M 0.53 10.47 0.22 506.38 
SD 0.50 0.38 0.41 70.68 
Min 0.00 8.24 0.00 284.00 
Max 1.00 Le 1.00 600.00 
N 386 315 386 382 


Self- Master’s Years CLASS CLASS CLASS 
efficacy degree exp. ES CO IS 
08 .03 .06" Gr Ao .08 
AS iis —1.03, .O1 07 sill .03 
oo ae .06** oli 10° .04 .00 
.24"* =.03 OL 05 .00 .02 
.40™ .06"* sapllilis 09 .03 = .05 
3.30 0.63 12.34 5.16 5.99 3:3)/ 
0.57 0.47 8.54 0.55 0.41 0.62 
1.40 0.00 1.00 3.83 4.39 1.83 
4.00 1.00 35.00 6.67 6.67 SRY 
379 59 59 59 59 59 


No 
Note. Student gender (0 = male, 1 = female); FRPL (0 = no, 1 = yes). ES = Emotional support; CO = Classroom organization; IS = Instructional 


support. Sample size for correlations ranged from 300 to 382. 
ie <10n a pi OS p< Ol. 
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Table 7 


RIMM-KAUFMAN, BAROODY, LARSEN, CURBY, AND ABRY 


Multilevel Model Results for Observed and Teacher-Reported Behavioral Engagement 


Observed 


Teacher-reported 





Measure b 


SE B b SE B 


Se ae ay emi ras as SA ae 


Observation level (Model 1) 


Weeks of the year 0.00 
Weeks of the year” 0.00 
Student level 
Child gender (female) 0.19" 
Child age —0.08 
Free/Reduced Price Lunch 0.07 
Initial achievement (Math) 0.00** 
Self-efficacy 0.02 
Teacher level (Models 1, 2A, 2B, 2C) 
Master’s degree (Model 1) 0.06 
Years of experience (Model 1) 0.00 
Concurrent emotional support (2A) 0.02 
Concurrent classroom organization (2B) Os. 
Concurrent instructional support (2C) —0.01 


0.01 0.01 

0.00 0.00 

0.05 0.22 Ona 0.06 0.17 
0.07 = (03 0:03 0.09 —0.02 
0.06 0.08 0.147 0.07 0.17 
0.00 0.14 0.00** 0.00 0.24 
0.04 0.01 0.10° 0.05 0.09 
0.06 0.10 —0.04 0.08 —0.03 
0.00 0.08 —0.00 0.00 —0.03 
0.03 0.02 0.04 0.06 0.05 
0.04 0.06 0.09 0.10 0.06 
0.02 0.00 0.02 0.06 0.03 


Note. Teacher-reported behavioral engagement was collected only once and thus the model does not include 
the observation level. Results of analyses addressing Research Question 3 showed no significant interactions and 


therefore the interaction term is not reported. 
ip pa Pl Oe OSs eae pa—a- Ol 


nization increased, boys reported higher levels of cognitive en- 
gagement, but there was no comparable association evident for 
girls. Likewise, findings showed a statistically significant interac- 
tion effect between classroom organization and gender for student- 
reported emotional engagement (b = —0.13, p < .01). As class- 
room organization increased, boys reported higher emotional 
engagement, but girls did not. Post hoc analyses were conducted 
for both interaction effects. The slope of classroom organization 


Table 8 


was statistically significant for boys (p < .01) but not girls (p > 
.05) for both outcomes. There was an interaction effect between 
instructional support and gender for student-reported social en- 
gagement (b = —0.06, p < .05). As instructional support in- 
creased, social engagement decreased for girls but not for boys. 
Post hoc analyses revealed a significant negative slope for girls 
(p < .01) but a nonsignificant slope (p > .05) for boys. (See Figure 
1 for a description of these interactions.) 


Multilevel Model Results for Student-Reported Cognitive, Emotional, and Social Engagement 





Cognitive Emotional Social 
Measure b SE b SE B b SE B 

Observation level (Model 1) 

Weeks of the year =0101> 0.00 —0.02 = O102% 0.00 —0.03 —0.01 0.00 —0.01 

Weeks of the year? 0.00** 0.00 0.00 0.00™ 0.00 0.00 —0.00* 0.00 —0.02 
Student level 

Child gender (female) 0.07** 0.03 0.17 0.09 0.05 0.24 0.14** 0.04 0.24 

Child age —0.03 0.04 —0.02 —0.02 0.08 —0.01 0.04 0.06 0.03 

Free/Reduced Price Lunch 0.06 0.04 0.14 OW 0.07 0.05 0.04 0.06 0.06 

Initial achievement (math) 0.00 0.00 0.02 0.00 0.00 0.01 0.00 0.00 0.03 

Self-efficacy Oui 0.03 0.23 0.26™* 0.05 0.24 0.30** 0.04 0.28 
Teacher level (Models 1, 2A, 2B, 2C) 

Master’s degree (Model 1) 0.07™* 0.03 0.20 0.05 0.05 0.13 0.08" 0.05 0.13 

Years of experience (Model 1) —0.00 0.00 —0.06 —0.00 0.00 —0.06 =O:01 0.00 =(122) 

Concurrent emotional support (2A) 0.03** 0.01 0.04 0.03** 0.02 0.03 0.03** 0.01 0.02 

Concurrent classroom organization (2B) 0.03** 0.01 0.03 0.03** 0.01 0.02 0.03** 0.01 0.02 

Concurrent instructional support (2C) =O:01 0.02 —0.01 —0.00 0.01 —0.01 —0.01 0.01 —0.01 
interactions (Models 3A, 3B, 3C) 

Gender X Emotional Support (3A) == — — — = a ae he a 

Gender X Classroom Organization (3B) —0.06"* 0.02 —0.08 =O 0.04 —0.09 — — — 

Gender X Instructional Support (3C) — — — == == —0.06** 0.02 —0.06 


a a LAP a a nde a a a ce 
Note. Student-reported engagement measures were collected in the fall, winter, and spring. Children = 387, teachers = 63, schools = 20. Interactions 
that were not statistically significant were not included in the final models and are not shown. 
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Interactions between teacher—student interaction quality (classroom organization and instructional 


support) and gender predicting student-reported engagement (cognitive, emotional, and social engagement). 


Changes in Engagement Over Time 


Although not the study’s focus, results showed that the linear 
and curvilinear trends of time (weeks in the year) were not signif- 
icant for observed behavioral engagement. Linear and curvilinear 
trends for time were present for all student-reported outcomes 
(cognitive, emotional, social engagement). For each student- 
reported outcome, the linear trend of time was significant (b from 
—.01 to —.02, p < .01) with a slight curvilinear relation (b < .01, 
p < .01), indicating that engagement decreased over time, but the 
decrease was steeper between Time 1 and 2 than 2 and 3. 


Discussion 


Three main findings emerged. First, the fifth graders, on average, 
showed high levels of math engagement, regardless of informant. 
Correlations between informants were lower than anticipated given 
the simultaneity of the data collection, a finding that suggests the 
unique vantage point of each informant. Second, the most systematic 
finding from the multilevel models was the link from teacher—student 
interaction quality to student-reported engagement. That is, students 
in classrooms with teachers who show warmth, caring, and individual 
responsiveness to their students reported working hard, enjoying 

‘learning about math, and sharing ideas and materials with other 
students in their classroom. Similarly, students in classroom with 
teachers who used proactive approaches to behavior management, 
facilitated smooth transitions between activities, and made learning 
objectives clear prior to learning also reported feeling greater cogni- 
tive, emotional, and social engagement in their math learning. Third, 
results showed higher engagement for girls than boys on three of the 
five engagement measures. Boys’ report of their cognitive and emo- 
tional engagement was more closely coupled to the classroom con- 


ditions (emotional and organizational support) than girls. An unex- 
pected finding was that boys reported higher social engagement but 
girls reported lower social engagement in the presence of higher 
instructional support. 


Measurement Concordance and Discordance 


Correlation coefficients between different informants of student 
engagement were statistically significant, but small (1% to 11% 
shared variance). In contrast, associations within informants were 
high (24% to 49% of shared variance), even when comparing the 
same informant on different subconstructs of engagement. Findings 
match other literature showing modest cross-informant agreement 
(Gresham, Elliott, Cook, Vance, & Kettler, 2010; Konold & Pianta, 
2007; Renk & Phares, 2004). Comparisons of informants can be 
considered in light of the integrative framework of motivation (Skin- 
ner et al., 2009). In theory, contexts, self-systems, and action are 
conceptually distinct, but in practice, accurate measurement of action 
(engagement) is challenging because it is tinged by characteristics of 
students’ self-systems (goals, expectancies, perceived task value) and 
contexts (classroom interactions) depending on informant. The results 
provide researchers with new understanding as they consider mea- 
surement trade-offs. 

We posit that low correlations among informants represent dispar- 
ities in perspectives on the classroom and cannot be dismissed as 
error. For example, correlations between the observer’s perception of 
behavioral engagement and students’ feelings of engagement were 
low (r = .24 to .26), although the data were collected simultaneously. 
Behavioral engagement can be observed reliably by a research assis- 
tant and therefore may provide a more objective standpoint for un- 
derstanding engagement. However, observed behavioral engagement 
may reflect superficial signs of engagement, whereas cognitive and 
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emotional engagement assessed by student ratings may reflect intra- 
psychic processes that, in part, are influenced by students’ self- 
systems. This difference is important by the time students reach fifth 
grade because children have become accustomed to the student 
“script” and may show signs of behavioral engagement without gen- 
uine feelings of connection to learning. Researchers evaluating math 
curricula and interventions may benefit from gathering information 
about students’ perception of engagement. 

The majority of engagement research relies on teacher-reported 
behavioral engagement at the end of the year. Despite research linking 
teacher-reported behavioral engagement to achievement (e.g., Hughes 
& Kwok, 2007; Valiente, Lemery-Chalfant, Swanson, & Reiser, 
2008), the present findings suggest that researchers should be cautious 
about overreliance on teacher-reported data. The correlations between 
observed behavioral engagement at three time points and teacher- 
reported behavioral engagement at the end of the year were low (from 
.29 to .33). Teacher-reported engagement is multidetermined and 
reflects students’ actual engagement as well as teachers’ attributes and 
perceptions (Mashburn et al. 2006). Teachers’ rating tendencies may 
be systematic; for instance, teachers’ ratings of fifth graders show 
greater inflation of girls’ scores compared to boys (Robinson & 
Lubienski, 2011), and teachers appear to be better reporters of exter- 
nalizing than internalizing problems (Konold & Pianta, 2007). 


Contribution of Gender and Interaction 
Quality on Engagement 


Girls were more engaged than boys for three of the five mea- 
sured engagement constructs: observed behavioral engagement, 
student-reported cognitive engagement, and student-reported so- 
cial engagement. Gender differences on self-reported emotional 
engagement and teacher-reported behavioral engagement ap- 
proached statistical significance. Girls’ higher observed behavioral 
engagement is consistent with other research suggesting higher 
behavioral engagement among girls than boys in late elementary 
and middle school (Marks, 2000; Wang, Willett, & Eccles, 2011). 
By definition, behavioral engagement involves the absence of 
disruptive behavior (Finn, Pannozzo, & Voelkl, 1995; Wang et al., 
2011). The gender difference in observed behavioral engagement 
fits with other work describing more disruptive behavior in boys 
than girls (Finn et al., 1995). Girls reported higher cognitive 
engagement in math, comparable to findings in seventh graders 
(Wang et al., 2011). 

The presence of emotional support was linked to students’ 
own perception of their engagement (cognitive, emotional, and 
social). By definition, emotionally supportive teachers show 
warm and responsive behavior toward students and facilitate a 
classroom climate in which students exhibit positive, prosocial 
behavior (Pianta et al., 2008). Emotional support signals a sense 
of security to students that permits full attention to the academic 
work. It also fosters a classroom environment with positive 
communication and respect among peers (Luckner & Pianta, 
2011). Both factors may be important for fifth graders facing 
challenging math learning. Findings match research pointing to 
the importance of the affective qualities of school, positive 
classroom climate, and teacher—student relationship for promot- 
ing engagement and learning (Borman & Overman, 2004; 
Decker, Dona, & Christenson, 2007; Dotterer & Lowe, 2011; 
Roorda et al., 2011; Stronge, Ward, & Grant, 2011; Reyes et ai., 


2012). The finding that teachers’ facilitation of a warm and 
supportive environment relates to students’ perceived engage- 
ment but not higher observed or teacher-reported engagement 
underscores the point that student-reported engagement taps 
intrapsychic processes. The result also emphasizes the impor- 
tance of emotionally supportive interactions between teachers 
and students in fifth grade, an issue that is crucial to convey to 
late elementary school math educators who are pressed for time 
or may perceive that relationship-building efforts are less es- 
sential for older students. 

Well-organized classrooms appear to support students’ en- 
gagement in math learning, as evidenced by higher observed 
behavioral engagement and student-reported cognitive and 
emotional engagement. The finding linking classroom organi- 
zation to observed behavioral engagement is not surprising; 
interesting learning formats, clear statement of expectations, 
high productivity are teaching practices that have been linked to 
observed student behavioral engagement in classic (Brophy, 
1983) and recent work (Downer, Rimm-Kaufman, & Pianta, 
2007). However, the result that higher classroom organization 
relates to students’ perception of their cognitive engagement 
(commitment to paying attention, desire to understand compli- 
cated material) and emotional engagement (enjoyment of math 
class and problem solving) stands out as important new contri- 
bution. By fifth grade, teachers may perceive students’ need for 
more autonomy. Effective practices attuned to fifth graders’ 
developmental needs involve fostering autonomy while main- 
taining clear objectives and minimizing classroom chaos 
(Eccles, 2004). 

Counter to expectation, results showed no main effects of 
instructional support on students’ engagement in learning. This 
is a surprising finding. One possible explanation stems from the 
reliance on the CLASS, a global measure of instructional sup- 
port that reflects teachers’ interactions with their whole class- 
room of students. Students within a single classroom show a 
wide range of abilities. Although teachers may be providing 
even amounts of concept development or high quality feedback 
to students across the classroom, the level of instruction may be 
too hard for some students, too easy for others, and just right for 
others. Another explanation pertains to the reliance on an 
observational measure of instructional support. Gathering in- 
formation on each student’s perception of instructional support 
from their teacher may increase accuracy. Future work is 
needed that considers students’ ability level relative to the level 
of the mathematical tasks and taps students’ perception of 
teachers’ instructional support. 

None of the three domains of teacher—student interaction quality 
related to teacher-reported behavioral engagement. Measuring 
teacher—student interactions and student engagement concurrently 
reveals associations between teacher and student behavior that 
otherwise may be masked in an end-of-the-year, teacher-reported 
measure. Teachers’ report of engagement may reflect teacher 
attributes as well as student engagement (Mashburn et al., 2006). 


Interactions Between Teacher-Student Interaction 
Quality and Gender 


Classroom organization was associated with students’ percep- 
tion of their engagement more for boys than girls. Boys may be 
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more distracted by chaotic learning environments and thus show 
more difficulty engaging in learning (Ponitz, Rimm-Kaufman, 
Brock, & Nathanson, 2009). Girls may have better developed 
self-regulatory skills and require fewer external structures to sup- 
port their engagement. 

Instructional support findings were surprising. Girls in class- 
rooms with higher instructional support reported lower social 
engagement, whereas there was no relation between instructional 
support and social engagement in boys. The finding may reflect 
complementarity between teacher—student and student—student in- 
teractions. High levels of instructional support may be more evi- 
dent in classrooms where teachers engage in frequent, high quality 
interactions with students but facilitate fewer peer interactions. In 
fact, two of the three instructional support CLASS dimensions 
(quality of feedback, language modeling) involve verbal interac- 
tions between teachers and students. Frequent teacher—student 
interactions may supplant peer-to-peer interactions that would 
occur otherwise (Patrick et al., 2007). It is unclear why this 
association was present among girls but not boys. 


Limitations 

Several limitations require mention. First, we did not emphasize 
peer interactions. By fifth grade, students are increasingly aware 
and influenced by peers. Peer liking and acceptance contribute to 
teacher—student interaction quality, peer nominations of academic 
competence, and students’ perception of their achievement self- 
efficacy (Hughes & Chen, 2011). Fifth graders may be sensitive to 
classroom composition. The engagement of a student’s peer group 
links to their engagement over the course of the year, net of other 
factors—a consideration not included in the present work (Kin- 
dermann, 2007). Second, data were gathered in the context of a 
descriptive study; therefore, the work does not support causal 
inferences. Future research using an experimental design is 
needed. Third, the three approaches to measuring engagement 
reflect different time sampling. Observational data were based on 
24 min spread across 3 days; student-reported data were based on 
students’ reflection on the full period of math class across those 
same 3 days. Teacher-reported measures were based on reflections 
of students over the course of the year. Fourth, the data collection 
did not include students’ reports of their teachers’ quality of 
interactions. Fifth, some of the measures have lower than ideal 
reliability. 


Closing Comments 


Recommendations for improving mathematics achievement 
hinge on teachers’ ability to engage students in learning in the 
classroom (CCSSI, 2014; NCTM, 2000; National Research Coun- 
cil, 2005), raising questions about the extent to which different 
types of teacher-student interactions contribute to enhanced en- 

_-gagement in the math classroom. On a daily basis, teachers rely on 
their perception of students to know whether to adjust the content 
and pace of learning to keep students engaged. However, by fifth 
grade students know how to appear interested and engaged, leav- 
ing teachers with questions about what they can do to be sure that 
students are putting forth their best effort and are truly curious and 
interested in the math. Findings lead to at least two implications. 
Teachers having difficulty gauging students’ interest, curiosity and 


attention toward math may want to rely less on their own insights 
or observations of an observer and generate strategies for receiving 
direct and honest feedback from their students. Despite the fact 
that fifth graders are not young children, the students, especially 
boys, appear to be well attuned to the warmth and responsiveness 
of their teacher and clarity of expectations in the classroom. 
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“Michael Can’t Read!” Teachers’ Gender Stereotypes and Boys’ 
Reading Self-Concept 
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According to expectancy-value theory, the gender stereotypes of significant others such as parents, peers, 
or teachers affect students’ competence beliefs, values, and achievement-related behavior. Stereotypi- 
cally, gender beliefs about reading favor girls. The aim of this study was to investigate whether teachers’ 
gender stereotypes in relation to reading—their belief that girls outperform boys—have a negative effect 
on the reading self-concept of boys, but not girls. We drew on a longitudinal study comprising two 
occasions of data collection: toward the beginning of Grade 5 (T1) and in the second half of Grade 6 (T2). 
Our sample consisted of 54 teachers and 1,358 students. Using multilevel modeling, controlling for T1 
reading self-concept, reading achievement, and school track, we found a negative association between 
teachers’ gender stereotype at T1 and boys’ reading self-concept at T2, as expected. For girls, this 
association did not yield a significant result. Thus, our results provide empirical support for the idea that 
gender differences in self-concept may be due to the stereotypical beliefs of teachers as significant others. 
In concluding, we discuss what teachers can do to counteract the effects of their own gender stereotypes. 


Keywords: gender stereotypes, reading self-concept 


Gender differences in students’ academic self-concepts often 
exceed differences in actual achievement (Hyde & Durik, 2005). 
Drawing on expectancy-value theory (e.g., Eccles & Wigfield, 
2002; Wigfield & Eccles, 2000), one compelling explanation of 
this discrepancy is that self-concepts develop, inter alia, as a 
function of the gender beliefs or stereotypes of significant others 
such as parents, peers, or teachers. Stereotypes are very powerful 
in shaping biased expectations of and behaviors toward groups, 
especially in regard to broad categories like gender (Schneider, 
2004). Such expectations and behaviors can in turn affect the 
self-concept of members of the stereotyped group. This is in line 
with the assumption of social identity theory (Tajfel & Turner, 
1986) that widely held stereotypes about social groups can influ- 
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ence a person’s view of her- or himself. People derive their identity 
in part from the social group they belong to and therefore from 
socially shared beliefs about their group’s. characteristics (cf. 
Tajfel, 1981). For example, girls may develop a positive verbal 
self-concept due in part to their knowledge of the social belief that 
girls and women are good at language-related tasks. Regarding 
educational outcomes and gender, the question as to which group 
is negatively stereotyped depends on the domain (Plante, de la 
Sablonniére, Aronson, & Théorét, 2013). Whereas there has been 
some research on the negative effects of stereotyping for girls in 
mathematics (see e.g., Nguyen & Ryan, 2008, for a review), little 
is known about the negative effects of stereotypes for boys in 
reading. In this longitudinal study, we aimed to investigate the 
relation of teachers’ gender stereotypes about reading as a stereo- 
typically female academic outcome (Schmenk, 2004) to students’ 
self-concept in reading. There has as yet not been much research 
testing the assumption of expectancy-value theory that teachers’ 
gender stereotypes may explain gender differences in students’ 
reading self-concept. 


Gender Differences in the Development of Language- 
Related Self-Concepts 


Gender is believed to play an important role in shaping 
students’ ability self-concepts (Eccles & Wigfield, 2002; 
Meece, Bower Glienke, & Burg, 2006; Wigfield & Eccles, 
2000). Since ability self-concepts are highly domain specific 
(Marsh, Trautwein, Liidtke, Kéller, & Baumert, 2006; Moller, 
Retelsdorf, Kéller, & Marsh, 2011), the question as to which 
gender is advantaged and which disadvantaged depends of 
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course on the particular domain. Typically, ability self-concept 
is higher for the gender that is stereotypically favored in a 
particular domain (Watt & Eccles, 2008). Thus, boys are be- 
lieved to have higher mathematics and related self-concepts, 
and girls to have higher language-related self-concepts. Indeed, 
there is compelling evidence that girls report higher confidence 
in their language abilities than do boys (Durik, Vida, & Eccles, 
2006; Eccles, Wigfield, Harold, & Blumenfeld, 1993; Ireson & 
Hallam, 2009; Wigfield et al., 1997) although not all studies 
have found such differences (Anderman et al., 2001; Evans, 
Copping, Rowley, & Kurtz-Costes, 2011; Skaalvik & Skaalvik, 
2004). Moreover, there is even some evidence from longitudi- 
nal studies that these gender differences increase over time. 
Jacobs, Lanza, Osgood, Eccles, and Wigfield (2002) reported 
such a widening gap between girls’ and boys’ language-related 
self-concept from Grades 1 to 12. In another longitudinal study, 
Archambault, Eccles, and Vida (2010) identified seven groups 
of children with distinct trajectories of language-related self- 
concept. They found a higher proportion of girls maintained the 
highest and most stable self-concepts over time; conversely, a 
higher proportion of boys indicated substantial self-concept 
decline. These results also indicated an increasing gender dif- 
ference over time. It is also noteworthy, however, that self- 
concepts decline for both boys and girls over time. Thus, the 
widening gender gap would appear to be a result of the steeper 
decline within the group of boys. 

A promising approach to the explanation of gender differ- 
ences in self-concept is provided by Eccles’ expectancy-value 
theory of achievement-related choices (e.g., Eccles et al., 1983; 
Eccles & Wigfield, 2002; Wigfield & Eccles, 2000). This 
theory, which provides a general model for the explanation of 
achievement-related choices and behaviors, has a particular 
focus on the understanding of gender differences. The model 
deals with the question of the circumstances under which a 
person will undertake a challenging achievement task. This is 
explained in terms of high value of the task and high expecta- 
tion of success. Moreover, the model also provides a valuable 
framework for the explanation of gender differences in ability 
self-concepts that are closely related to one core variable of the 
model— expectation of success. According to expectancy-value 
theory, a person’s self-concept is shaped not only by his or her 
previous achievement but also by a variety of social and cultural 
factors. These factors comprise cultural gender roles that pre- 
scribe certain behaviors as appropriate or inappropriate for 
males or females, as well as gender stereotypes. Moreover, the 
behaviors and beliefs of significant others, such as peers, par- 
ents, and teachers, play an important role in shaping students’ 
self-concepts. In the present research, we focused on the role of 
teachers, as there is some evidence that teachers can contribute 
to the gender gap. For example, they may pay more attention to 
boys than to girls (DeZolt & Hull, 2001) and communicate 
overall more with boys than with girls—in particular, approving 
boys’ academic behavior and disapproving their social behavior 
more frequently (Swinson & Harrop, 2009). However, there has 
been little research directly connecting teachers’ gender beliefs 
with student outcomes. In the present research, we addressed 
this lacuna by investigating the effect of teachers’ gender ste- 
reotypes about reading on students’ self-concepts. 


Gender Stereotypes in Education 


Stereotypes can be broadly defined as “shared beliefs about 
personality traits and behaviors of group members” (Fiedler & 
Bless, 2001, p. 123). Stereotyping results from categorizing indi- 
viduals into groups, according to their presumed common attri- 
butes. While stereotypes can function as cognitive schemas to 
facilitate social interactions with unknown individuals, as over- 
generalizations of traits for a group in general, they also shape 
expectations and behaviors. Consensually shared stereotypes 
within a culture can serve as social norms for behavior toward the 
stereotyped group (e.g., Asbrock, Nieuwoudt, Duckitt, & Sibley, 
2011; Cuddy, Fiske, & Glick, 2007). In respect to gender, the two 
groups, males and females, are presumed to differ in their traits, 
abilities, and motivation (cf. Schmenk, 2004). The latter two are of 
particular interest in education while, as mentioned earlier, stereo- 
types depend on the particular domain that is being considered. 
Research investigating gender stereotypes in the educational con- 
text has mainly focused on stereotype threat—a phenomenon 
describing how stereotypes can become self-fulfilling in a partic- 
ular situation (Aronson & Steele, 2005; Steele, 1997). Stereotype 
threat means a situational threat due to a negative stereotype about 
one’s own group (Steele, 1997). In the educational context, ste- 
reotyped persons feel extra pressure not to fail in a situation where 
academic competence is relevant. Regarding gender, there is quite 
strong evidence—mainly from experimental research—for the 
negative impact of stereotype threat on the performance of girls or 
women in mathematics tests (e.g., Nguyen & Ryan, 2008 for a 
review). Moreover, in a recent study, Hartley and Sutton (2013) 
investigated the role of stereotype threat in boys’ general academic 
underachievement. In one study, they showed that girls and boys 
believed that girls academically outperform boys and also thought 
that adults believed this to be true. In a second study, they manip- 
ulated stereotype threat by telling the children in their sample that 
boys tend to perform lower than girls at school. This manipulation 
negatively affected the boys’ performance in reading, writing, and 
mathematics but had no effect on girls’ performance. 

Moreover, Plante et al. (2013) have investigated students’ own 
gender stereotypes and their associations with self-concept, task 
values, and achievement in a naturalistic setting. They tested the 
hypothesis from expectancy-value theory (e.g., Eccles & Wigfield, 
2002) that the relationship between gender stereotypes and aca- 
demic outcomes is mediated by students’ self-concepts and task 
values in the corresponding domain. In their cross-sectional study, 
they found that effects of gender stereotypes on achievement in 
mathematics and language arts were mediated by students’ self- 
concepts and task values. However, Plante et al. (2013) only 
investigated the students’ own stereotypes. Thus, the idea of 
expectancy-value theory, that the gender beliefs of significant 
others affect students’ self-concept development, could not be 
tested. Generally, this is an under-researched issue. Even though 
stereotypes have been a “hot topic” in general, as Jussim, Eccles, 
and Madon (1996) realized, only a few studies have investigated 
the effects of stereotypes of significant others in more naturalistic 
settings. To the best of our knowledge, there has not been much 
development in the research since this conclusion was drawn. One 
notable exception is research showing that parents’ stereotypic 
beliefs affect children’s perceptions of their ability. For example, 
Jacobs and Eccles (1992) have shown that across three domains— 
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mathematics, sports, and social domain—mothers’ gender stereo- 
types either lead to an overestimated perception of their child’s 
ability, if the child is stereotypically favored, or to an underesti- 
mation of their child’s ability, if the child is stereotypically disad- 
vantaged. In turn, the mothers’ perceptions of their child’s ability 
affect the children’s own perception of their ability. Similarly, 
Tiedemann (2000) found that mothers’ and fathers’ gender stereo- 
types predicted their beliefs about their child‘s abilities, which in 
turn were related to their child’s self-perceptions of ability. More 
recently, Rouland, Rowley, and Kurtz-Costes (2013) found that 
parents’ gender stereotypes were related to their attributions for 
their children’s academic successes and failures that in turn were 
related to the children’s own self-beliefs. 

However, less is known about the effects of stereotypes in other 
groups of significant others. In the educational context, of course, 
one of the most important groups is that of teachers, because they 
interact with children on a daily basis, instruct them, judge them, 
and—as a consequence—develop evaluations of the children’s 
cognitive and social development and long-term career prospects. 
Indeed, there has been a vast amount of research on the related 
issue of teacher expectations for low- and high-achieving students 
(for a review, see Jussim & Harber, 2005). While there has been 
some research on student gender as a potential moderator of 
teacher expectation effects (e.g., de Boer, Bosker, & van der Werf, 
2010), there are fewer studies on teachers’ explicit beliefs about 
boys’ and girls’ different domain-specific abilities. Such beliefs, 
however, may have significant consequences on students’ out- 
comes. Teachers acting upon gender stereotypes could—con- 
sciously or unconsciously—shape social interactions in class by, 
for example, creating a warm and challenging atmosphere for 
students from positively stereotyped groups and a cold and less 
challenging environment for students from negatively stereotyped 
groups (Aronson & Steele, 2005). Moreover, in one of the few 
studies on teachers’ explicit gender stereotypes, Tiedemann (2002) 
found that stereotypes are related to the teachers’ beliefs about 
effort and ability in mathematics. 

In research into the effects of teachers’ gender stereotypes, one 
should be aware of the particular age of the students participating 
in the investigation. There is some research showing that with 
increasing age, children become more and more aware of widely 
held stereotypes (Martinot, Bagés, & Désert, 2012; McKown & 
Weinstein, 2003) and are more likely to endorse traditional ste- 
reotypes themselves (Rowley, Kurtz-Costes, Mistry, & Feagans, 
2007). Even more important, in relation to the present research is 
that students in late childhood or early adolescence become more 
and more aware of other persons’ stereotypes. In a study by 
Kurtz-Costes, Rowley, Harris-Britt, and Woods (2008), for exam- 
ple, middle school children seemed to be more aware of adult 
stereotypes than were elementary school children. Thus, even 
though teachers may, of course, shape students’ self-concepts at a 
young age, the focus of the present research on investigating the 
effects of teacher stereotypes in late childhood seemed appropriate. 


The Present Investigation 


Drawing on the idea that the gender-related beliefs and actions 
of significant others such as peers, parents, and teachers may affect 
the development of students’ academic self-concept (e.g., Eccles & 
Wigfield, 2002; Wigfield & Eccles, 2000), we aimed to investigate 


the relation of teachers’ gender stereotypes to students’ reading 
self-concept. We followed a longitudinal design with two waves of 
data collection. Our study went beyond previous research, as we 
investigated the consequences of teachers’ explicit gender beliefs 
for the development of reading self-concept as a relatively stable 
personal characteristic. We analyzed the effect of teachers’ stereo- 
types on reading self-concept over and above previous reading 
achievement. This is important, because it is obvious that prior 
academic achievement is influential in the formation of subsequent 
self-concept (Shavelson, Hubner, & Stanton, 1976); this is also 
true in the domain of reading (Retelsdorf, K6ller, & Moller, 2014). 
Moreover, to account for the possible influence of ability grouping 
on students’ self-concept (e.g., Marsh et al., 2008), we included the 
aggregated between-level achievement and reading self-concept as 
well as school track into our data analysis. In Germany, after 
elementary school, students are assigned to different types of 
school; these aim to prepare students either for a vocational ap- 
prenticeship (nonacademic track schools) or for university en- 
trance (academic track schools). 

Since gender beliefs about reading stereotypically favor girls 
(Plante et al., 2013; Schmenk, 2004), we expected that the negative 
gender stereotypes of boys’ reading abilities would affect their 
reading self-concept. For girls, however, the expectations were less 
clear. On the one hand, there is evidence that even positive 
stereotypes can have negative effects, because high expectations 
may lead to so-called choking under pressure, which results in 
lower performance (Cheryan & Bodenhausen, 2000). On the other 
hand, girls’ reading self-concepts are quite positive (Archambault 
et al., 2010), and the effects of stereotypes are generally expected 
to be rather small, so that a significant effect of teachers’ gender 
stereotype on girls’ reading self-concept was not expected. 


Method 


Sample and Procedure 


Our sample stemmed from the larger longitudinal project LISA 
(in the German: Lesen in der Sekundarstufe [Reading in secondary 
school]), which mainly deals with the individual and contextual 
determinants of reading comprehension (e.g., Retelsdorf, Becker, 
Koller, & M@ller, 2012; Retelsdorf, K6ller, & Mdller, 2011). This 
study drew on a sample of 1,508 secondary school students from 
60 classes, drawn as representative of the federal state of 
Schleswig-Holstein, Germany. Data collection was performed by 
trained research students and took place as group tests carried out 
in class during regular lessons. The student questionnaire including 
the reading self-concept measure and the reading achievement 
tests was administered toward the beginning of Grade 5, a few 
weeks after the beginning of the school year (T1) and again after 
an interval of approximately 18 months, in the second half of 
Grade 6 (T2). Moreover, within 14 days of the data collection 
among the students at T1, all 60 German language teachers were 
asked to work on a teacher questionnaire including the items 
measuring their gender stereotypes; 54 teachers answered (66% 
female). Thereby, it is the established practice in secondary school 
that teachers usually change only every 2 years. In this study, only 
those students were included for whom teacher data also were 
available; this reduced the sample to 1,358 students (49% girls; 
girls’ age at Tl: M = 10.96, SD = 0.61; boys’ age at Tl: M = 
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10.82, SD = 0.51; 36% at academic track schools). Applying t 
tests for reading achievement and reading self-concept and chi- 
square-tests for students’ gender, we tested whether the excluded 
students differed in the study variables from the included students. 
None of these tests yielded significant results (p = .135). 


Measures 


Reading self-concept. We assessed reading self-concept with 
a subscale from the Habitual Reading Motivation Questionnaire 
(Moller & Bonerad, 2007) that comprises four items measuring 
students’ evaluations of their own reading skills. Thus, the self- 
concept items refer to the comprehension of texts rather than to 
more basic reading skills (e.g., “Generally, understanding texts is 
easy for me“). Students rated their agreement with each item on a 
4-point Likert-type scale anchored at 1 (does not apply to me) and 
4 (applies to me). Cronbach’s alpha measures were sufficient at 
both waves of data collection (a;, = .74, ap = .75). 

Teachers’ gender stereotypes. Teachers at Tl were asked to 
answer three questions measuring their gender stereotypes about 
reading. They were asked if boys or girls read better, read more, 
and have more fun reading. Each answer was rated on a 5-point 
Likert-type scale, anchored at 1 (boys much better/more) and 5 
(girls much better/more). The reliability of the scale was good 
(a = .87). 

Reading achievement. In this study, we used reading com- 
prehension tests from the German section of the Progress in 
International Reading Literacy Study (Bos et al., 2005). The stu- 
dents’ task was to read several texts and answer questions on their 
content. The questions mainly focused on students’ skills in form- 
ing a broad and general understanding of the texts and in retrieving 
information from the texts. The test comprised 27 items—mainly 
multiple-choice items with four possible answers, but some open- 
format questions also were included. The item parameters were 
estimated by applying the partial credit model, because some items 
were scored polytomously. We estimated weighted-likelihood es- 
timates (WLE) as subjects’ ability scores using ConQuest (Wu, 
Adams, & Wilson, 1998). The WLE-reliability of the reading tests 
was sufficient (.82). 


Statistical Analyses 


We analyzed the association between teachers’ gender stereo- 
types and students’ self-concept by means of multiple group mul- 
tilevel modeling, using Mplus Version 7.1 (Muthén & Muthén, 
2013). Thereby, every teacher taught one class in our sample so 
that between-teacher and between-class effects are the same. Read- 
ing self-concept at T1, reading achievement, and teachers’ gender 
stereotype were standardized (M = 0, SD = 1). Reading self- 
concept at T2 was standardized at the Tl mean and standard 
deviation of reading self-concept. To test our assumption that 

_teachers’ gender stereotypes affect boys’ but not girls’ self- 
concept, we specified a multiple group model with gender as a 
grouping variable. Since every teacher, however, teaches boys and 
girls, we had to deal with the situation that the grouping variable 
was within level. Thus, within each cluster, there could be varying 
random effects for boys and girls that could not be directly spec- 
ified as multiple group multilevel models. Asparouhov and 
Muthén (2012) have suggested introducing latent variables that 


represent this variation in between-level random effects. This 
approach also allows proper accounting for the covariance be- 
tween the two group specific cluster effects. We tested a series of 
models predicting reading self-concept at T2 using this approach. 
In the first model, we included reading self-concept at Tl and 
teachers’ gender stereotypes as predictors. In the second model, we 
additionally controlled for students’ reading achievement at T1. 
Third, we additionally included aggregated scores of reading self- 
concept and reading achievement at Tl and school track as 
between-level covariates. The aggregated data were not standard- 
ized again at between-level. 

We evaluated effect sizes to facilitate the interpretation of our 
results, following Tymm’s (2004) proposal for calculating effect 
sizes in multilevel models. The effect size A can be interpreted 
similarly to Cohen’s d (Cohen, 1988) and is calculated using the 
unstandardized regression coefficient in the multilevel model, the 
standard deviation of the predictor variable at between level, and 
the residual standard deviation at within level. 

Due to missing data, we used multiple imputed data in all 
analyses as a state-of-the-art approach to address this problem (cf. 
Graham, 2009). On average, 11% of the data per variable were 
missing. We applied multiple imputation to create 20 complete 
data sets using Mplus 7.1 (see Graham, Olchowski, & Gilreath, 
2007, for a discussion on the sufficient number of imputations). 
All subsequent analyses were then conducted 20 times, and the 
results were combined automatically in Mplus. 


Results 


Descriptive Statistics 


As presented in Table 1, students’ reading self-concept was 
above the theoretical mean of 2.5 at Tl and T2, indicating that 
students were quite confident in their reading skills. Moreover, 
boys had a higher reading self-concept than girls at T1, whereas 
girls had a higher reading self-concept than boys at T2. However, 
none of these differences yielded significance in a Wald chi-square 
test: x7(1) = 3.714, p = .054. Girls also gained higher reading 
achievement scores at Tl. Finally, the relatively high score of 
teachers’ gender stereotypes indicated that, on average, the teach- 
ers believed that girls had higher reading abilities than boys. 


Multilevel Analyses 


We estimated the intraclass correlation (ICC) for reading 
self-concept at T2, testing the proportion of total variance that 








Table 1 
Means and Standard Deviations of the Study Variables 
Overall Girls Boys 
Variable M SD ‘MSD MES) 
Reading self-concept T1 3.039 OF, 0g2.99 aeO vial 3.08 0.68 


3.06 0.65 
SOROS 


Reading self-concept T2 3.08 0.63. 3.10: 0:61 
Reading achievement T1 =0:05 1.12 OOl02 
Teachers’ gender stereotype 3.91 0.60 


Note. Weighted likelihood estimates have been estimated as subjects’ 
ability scores for reading achievement. Nyeachers = 54, Nstudents = 1,358. 
Tl = Time 1; T2 = Time 2. 
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could be attributed to between-class differences, resulting in an 
ICC of .114. Thus, with more than 10%, a substantial amount of 
the variance in reading self-concept goes back to differences 
between classes. 

The results of our multiple group multilevel analyses are pre- 
sented in Table 2. First, we tested a model (Model 1) in which 
reading self-concept at T1 was included as a within-level predictor 
and teachers’ gender stereotype as a between-level predictor of 
reading self-concept at T2. For boys and girls, reading self-concept 
proved to be a significant predictor; thus indicating a certain 
stability of reading self-concept. Moreover, as expected, a signif- 
icant negative effect of teachers’ gender stereotypes on students’ 
reading self-concept was recorded for boys but not for girls (effect 
sizes: Ayy, = —-28, A,ins = —-01). The difference between boys 
and girls was tested by applying a Wald chi-square test, which 
indicated that the association between teachers’ gender stereotype 
and reading self-concept was significantly stronger for boys than 
for girls, x7(1) = 11.05, p < .001. In Model 2, we additionally 
included reading achievement at T1 as a within-level predictor; 
this also proved to be a significant predictor of reading self- 
concept at T2. The effect of reading self-concept at T1 was still 
significant but slightly smaller than in Model 1. Moreover, the 
negative effect of teachers’ gender stereotype was again recorded 
for boys but not for girls (effect sizes: Ay, = —.25, Agia, = —-03). 
Again, this difference was significant, y7(1) = 6.10, p < .05. 
Finally, we tested a third model (Model 3), in which we addition- 
ally included aggregate scores of reading achievement at T1 and 
reading self-concept at Tl and school track as between-level 
covariates. None of these additional variables yielded significance. 
Moreover, the effects of the within-level predictors and teachers’ 
gender stereotype were similar to those in Model 2 (effect sizes: 
Avoys = —-23, Agins = —-04). The Wald chi-square test comparing 
the effect of teachers’ gender stereotype between boys and girls, 
again was significant, y7(1) = 3.94, p < .05. 

To illustrate the differential associations between teachers’ gen- 
der stereotypes and students’ reading self-concept, we plotted 
simple slopes for the results of Model 2 for boys and for girls 
(Figure 1). We chose this model because the additional predictors 
in Model 3 did not contribute to the prediction of reading self- 
concept at T2. Therefore, we understand Model 2 to be the most 








Students’ reading self-concept T2 








0.20 4 boys 
Teachers' gender stereotype — — guls 


Figure J. Relation between teachers’ gender stereotype on boys’ and 
girls’ reading self-concept at T2 (from Model 2 in Table 1; all variables 
have been standardized). T2 = Time 2. 


relevant model; it also meets the claim of parsimony. The simple 
slope analysis for Model 3, however, resulted in a similar pattern. 
Stronger gender stereotypes—such as, that teachers believe that 
girls outperform boys in reading—are associated with boys’ lower 
reading self-concept, whereas girls’ reading self-concept was un- 
affected by teachers’ stereotype. 

As an exploratory analysis, we also tested whether teachers’ 
gender or the interaction Teachers’ Gender X Teachers’ Gender 
Stereotype had different effects on boys’ and girls’ reading 
self-concept at T2. In line with the assumption of the so-called 
same-sex teacher advantage (for a detailed discussion, see 
Neugebauer, Helbig, & Landmann, 2011), one might have 
expected that boys’ reading self-concept would benefit from a 
male teacher and girls’ reading self-concept might benefit from 
a female teacher. Moreover, these benefits could be due to 
different gender stereotypes, depending on the teachers’ gender. 
However, neither teachers’ gender nor the interaction term was 
significant predictors of boys’ and girls’ reading self-concepts 
(p = .154). Moreover, the results of the model, including 
teachers’ gender and their interaction, were by and large the 
same as the results of Model 3. 


Table 2 
Results of the Multiple Group Multilevel Analyses Predicting Reading Self-Concept at T2 
Model 1 
Girls Boys 
Variable B SE B SE 


Within level 
Reading self-concept T1 456 030 474 035 
Reading achievement T1 
Between level 
Teachers’ gender stereotype T1 
Reading self-concept T1 
Reading achievement T1 
School track 


—.003 .026 —.103 .035 


Note. 


significant (p < .05). T1 = Time 1; T2 = Time 2. 


Model 2 Model 3 
Girls Boys Girls Boys 
B SE B SE B SE B SE 
387 032 .389 033 389 .033 378 035 
243 .030 273 .037 257 .037 254 046 
OD .024 —.090 033 O13 .024 —.082 034 
044 156 O17 149 
093 095 —.006 106 
=1019 .088 alas .096 


All variables but the dummies have been standardized (reading self-concept T2 was standardized at the mean at standard deviation of reading 
self-concept T1); school track was dummy-coded (0 = nonacademic track, 1 = academic track). N, 


eachers a 34, Nedente = 1,358. Bold = parameters are 
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Discussion 


The aim of this research was to investigate whether teachers’ 
stereotypes affected students’ self-concepts in reading, a stereo- 
typically female domain. We expected teachers’ gender stereo- 
types about students’ reading abilities—namely, that girls perform 
better in reading tasks—to negatively affect boys’ but not girls’ 
reading self-concepts. Therefore, we drew on longitudinal data 
comprising two waves of data collection to predict students’ read- 
ing self-concept at the end of Grade 6 with the previously reported 
(at the beginning of Grade 5) teacher stereotypes, controlling for 
previous reading self-concept. Our hypothesis was corroborated: 
boys’ reading self-concept in Grade 6 was lower for students 
whose teachers reported high scores for gender stereotypes. No 
effect was recorded for girls. Moreover, the effect was also robust 
when siudents’ previous achievement on individual and class 
levels, and their school track were included. Thus, our results have 
shown that teachers’ gender stereotypes negatively affect boys’ 
reading self-concept over and above their actual performance. 
Additionally, our results indicate that, on average, teachers’ read- 
ing stereotypes favor girls. Consequently it is possible that even 
less stereotyped teachers favor girls over boys in the reading 
domain, indicating that the total effect of gender stereotypes might 
be greater than we can show in our analyses. However, this 
interpretation is rather speculative and needs further research. 

Before discussing the implications of our findings in more 
detail, we first discuss the absence of gender differences in the 
mean level.of reading self-concept. This finding is in line with 
other research that does not support the assumption of gender 
differences in language-related self-concepts (Anderman et al., 
2001; Evans et al., 2011; Skaalvik & Skaalvik, 2004). One possi- 
ble explanation deals with the particular age of our students. 
Conjecturally, at the onset of puberty, the intensification of gender 
differences is only beginning. Although there is no evidence for 
such gender intensification in longitudinal studies (Jacobs et al., 
2002), our results do suggest the tendency for an opposing trend in 
girls’ and boys’ reading self-concept, favoring girls. Another ex- 
planation, provided by Skaalvik and Skaalvik (2004), deals with 
the idea that gender differences in self-concepts are based on 
perceptions of individual strength and weaknesses across different 
domains—similar to what is proposed in the dimensional compar- 
ison theory (Moller & Marsh, 2013). Thus, gender differences 
would not become obvious in group comparisons within a single 
domain but only in comparisons of self-concepts in different 
domains. Our data however do not allow for analyses along these 
lines. Regardless of this question of group differences, our results 
nevertheless provide some evidence that variability in reading 
self-concept development may be explained in part by teachers’ 
gender stereotypes. In the remainder of this article, we discuss the 
implications of these findings. 


Theoretical and Practical Implications 


Our findings help our understanding of the development of 
reading self-concept in secondary school and contribute to our 
knowledge of possible reasons for gender differences in self- 
concept. However, even though our study comprised longitudinal 
data, and we were able to control for important predictors of 
reading self-concept, we cannot draw causal conclusions, since we 
cannot rule out the effects of unobserved variables. Our study, 


however, complements experimental data on the consequences of 
specific stereotype content (e.g., Becker & Asbrock, 2012; Cuddy 
et al., 2007) by providing high external validity, due to the natu- 
ralistic setting in actual school life. The results support the as- 
sumption of expectancy-value theory, that gender beliefs of sig- 
nificant others play an important role in shaping students’ ability 
self-concepts. We found evidence that, in addition to parents’ 
(Jacobs & Eccles, 1992; Tiedemann, 2000) and students’ own 
(Plante et al., 2013) stereotypes, teachers’ gender stereotypes also 
play an important role in shaping students’ self-concepts—over 
and above students’ actual achievement. Consequently, these ste- 
reotypes might explain to some extent why gender differences in 
language-related self-concept increase over time (Archambault et 
al., 2010; Jacobs et al., 2002). However, these effects were rather 
small in terms of Cohen’s (1988) classification of effect sizes. At 
least in children 10 years and older, however, reading self-concept 
seems to be quite stable (Retelsdorf et al., 2014), so that large 
effects cannot be expected, and thus, even small effects may still 
be of practical relevance. This might be even more relevant when 
taking into account that teachers’ stereotypes are a rather distal 
determinant of students’ self-concept compared with their achieve- 
ment or other student-level variables. It might be interesting, 
though, to test the relations of teachers’ gender stereotypes with 
younger students’ self-concept development. Jacobs et al. (2002) 
reported much greater decreases of language self-concept from 
Grade 1 to Grade 5 than from Grade 6 to Grade 12—particularly 
for boys. Thus, there might be sensitive developmental stages in 
which environmental influences have particularly pronounced ef- 
fects on children’s self-concept— however, younger students may 
not yet be aware of teachers’ stereotypes (e.g., Muzzatti & Agnoli, 
2007). Moreover, there may be greater cumulative effects over a 
longer period of time. 

Another open question deals with the mediating processes. We 
were not able to investigate such processes between teachers’ 
gender stereotypes and students’ reading self-concept. Thus, we do 
not know whether teachers who think that boys are less able to 
read than girls actually treat boys and girls differently. However, it 
seems plausible that teachers’ beliefs would influence their own 
behavior in classroom, as indicated by experimental studies on the 
effects of specific stereotypes on outgroup-directed behavior (e.g., 
Becker & Asbrock, 2012; Cuddy et al., 2007; for an overview, see 
Cuddy, Glick, & Beninger, 2011). As a consequence, boys’ read- 
ing self-concept might suffer from teachers’ behavior even when 
their reading abilities are similar to girls’ abilities. As Rubie- 
Davies, Hattie, and Hamilton (2006) discussed, there is some 
evidence that teachers who hold stereotypes regarding particular 
groups alter their practices and limit opportunities to learn for 
negatively stereotyped students. This is in line with research on the 
effects of incompetence stereotypes: Groups perceived as incom- 
petent are ignored or excluded more than other groups (Cuddy et 
al., 2011). Moreover, teachers believing in a certain stereotype 
may tend to make remarks or to behave in ways that make these 
stereotypes more salient in class, thus indicating stereotype threat 
(Aronson & Steele, 2005). Apart from teachers’ classroom behav- 
ior, increasing awareness of widely held stereotypes (McKown & 
Weinstein, 2003) and developing knowledge of adults’ stereotypes 
(Muzzatti & Agnoli, 2007) may shape the students’ own gender 
beliefs. As a further consequence, they may react by adapting their 
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own self-concept to these gender beliefs (cf. Kurtz-Costes et al., 
2008). 

Another question deals with the problem of the accuracy of 
teachers’ gender beliefs: the so-called “kernel of truth” of their 
gender stereotype. It could be argued that, taking into account 
recent results from large-scale assessments (cf. Mullis, Martin, 
Foy, & Drucker, 2012; Organization for Economic Co-Operation 
and Development, 2010), teachers’ beliefs of girls outperforming 
boys in reading are to some extent true. This problem is somehow 
connected to the argument that teacher expectations have an im- 
pact on students’ achievement simply because their expectations 
are accurate (Jussim & Harber, 2005). According to our results, 
this would mean that there is a negative effect of teachers’ gender 
stereotype on boys’ reading self-concept just because boys do in 
fact have lower reading abilities. These lower abilities should then 
lead to boys’ decreasing reading self-concept. Similarly, since 
boys are likely to show declining motivation related to language- 
related tasks in Grade 5 and Grade 6 (cf. Jacobs et al., 2002), our 
results may reflect teachers’ accurate appraisal of boys’ declining 
language-related motivation. In this study, however, we controlled 
for individual achievement as well as for mean class achievement, 
so that teachers’ gender stereotypes have been shown to exert an 
additional effect on students’ reading self-concept that goes be- 
yond the effect of actual achievement. Moreover, the teacher 
questions on gender stereotypes were generally worded, not related 
to the particular classes they were teaching. Considering that the 
teachers completed the questionnaire only a few weeks after they 
first met their students, it seems plausible to assume that the 
teachers’ beliefs about gender differences were not affected by the 
individual students’ motivational declines. 

An important question that arises from our findings is what 
teachers can do to counteract the reported relation between their 
own stereotypes and boys’ reading self-concept. Generally, it is a 
good idea to counteract prior gender stereotypes and make the 
expectation clear in class that boys and girls perform equally well 
(Hartley & Sutton, 2013). Moreover, during their teacher educa- 
tion, teachers should be apprised of the fact that their beliefs do 
have consequences and that, consciously or not, they may be prone 
to certain biases in their treatment of boys and girls. Although 
cultural stereotypes are widely shared and guide behavioral reac- 
tions, people can choose to overcome this automatic effect (Fiske, 
2004). Most research investigating similar discriminatory behavior 
in class has dealt with girls in mathematics and science, and thus 
the question here is whether language teachers behave similarly. 
We cannot answer this question yet, due to the lack of research in 
language teaching, but it would appear that certain rules for teach- 
ers’ classroom behavior, as summarized by Woolfolk (2010), 
should be introduced in the near future. 

However, it should be noted that there is strong evidence that in 
general, teachers interact more frequently with boys than with girls 
(Jones & Dindia, 2004). This difference has mainly been found in 
relation to negative interactions such as criticism, while no differ- 
ence in positive interactions such as praise or acceptance has been 
found. Unfortunately, Jones and Dindia did not test the effects of 
domain or school subject, so their meta-analysis does not provide 
information on differences in mathematics and language teaching. 
In a study by Worrall and Tsarna (1987), the self-reported class- 
room interactions of science and language teachers were com- 
pared. These authors found that girls are relatively favored in 


language subjects, compared with boys, whereas no differences in 
science have been reported. Regarding the question as to what 
teachers can do, Woolfolk (2010) suggested a kind of checklist on 
how to avoid discriminatory behavior in the classroom. First of all, 
she encouraged teachers to be aware of bias in their own behavior. 
Do the teachers group boys and girls for certain tasks? Do they 
prefer boys or girls when asking questions regarding particular 
topics—for example, boys for technical and girls for social issues? 
Second, she asked teachers to check their teaching material for 
gender inequalities, such as presenting traditional role models. 
Third, teachers should have a critical look at general inequalities at 
the school—for example, if there is biased advice regarding course 
selection. Fourth, teachers should use gender-neutral language 
whenever possible. Fifth, teachers should introduce role models 
that do not represent traditional gender roles. 


Conclusion 


Our study complements previous research by investigating the 
effects of teachers’ stereotypes on students’ reading self-concept, 
drawing on a relatively large sample tested in a naturalistic setting. 
Our results suggest that not only do gender stereotypes have 
short-term effects like those investigated in the framework of 
stereotype threat theory (cf. Aronson & Steele, 2005), but they can 
also explain the long-term development of reading self-concept as 
a relatively stable personal characteristic. In our study, boys were 
the disadvantaged group. Therefore, we would like to follow 
Hartley and Sutton (2013) in noting that these results have to be 
considered in the light of general male advantage in society, such 
as the gender pay gap that still persists (e.g., Council of the 
European Union, 2010; Drago & Williams, 2010). However, it 
should not be the aim to pit males’ advantages in one area against 
their disadvantages in another area. We should encourage and 
enable our teachers to counteract prior gender stereotypes and to 
become aware of their own potentially discriminatory behaviors. 
One important condition for an equitable educational system is that 
teachers should become aware of and resistant to stereotypes. 


References 


Anderman, E. M., Eccles, J. S., Yoon, K. S., Roeser, R. W., Wigfield, A., 
& Blumenfeld, P. C. (2001). Learning to value mathematics and reading: 
Relations to mastery and performance-oriented instructional practices. 
Contemporary Educational Psychology, 26, 76-95. doi:10.1006/ceps 
.1999.1043 

Archambault, I., Eccles, J. S., & Vida, M. N. (2010). Ability self-concepts 
and subjective value in literacy. Joint trajectories from Grades 1 through 
12. Journal of Educational Psychology, 102, 804-816. doi:10.1037/ 
a0021075 

Aronson, J. M., & Steele, C. M. (2005). Stereotypes and the fragility of 
academic competence, motivation, and self-concept. In A. J. Elliot & 
C. S. Dweck (Eds.), Handbook of competence and motivation (pp. 
436-456). New York, NY: Guilford Press. 

Asbrock, F., Nieuwoudt, C., Duckitt, J., & Sibley, C. G. (2011). Societal 
stereotypes and the legitimation of intergroup behavior in Germany and 
New Zealand. Analysis of Social Issues and Public Policy, 11, 154-179. 
doi:10.1111/j.1530-2415.2011.01242.x 

Asparouhov, T., & Muthén, B. O. (2012). Multiple group multilevel 
analysis (Mplus Web Notes, No. 16). Retrieved from www.statmodel 
.com/examples/webnotes/webnote16.pdf 

Becker, J. C., & Asbrock, F. (2012). What triggers helping versus harming 
of ambivalent groups? Effects of the relative salience of warmth versus 


TEACHERS’ STEREOTYPES AND BOYS’ READING SELF-CONCEPT 193 


competence. Journal of Experimental Social Psychology, 48, 19-27. 
doi:10.1016/j.jesp.2011.06.015 

Bos, W., Lankes, E.-M., Prenzel, M., Schwippert, K., Valtin, R., Voss, A., 
& Walther, G. (2005). IGLU: Skalenhandbuch zur Dokumentation der 
Erhebungsinstrumente. [IGLU: Scale manual for documentation of the 
survey instruments in the PIRLS (Progress in International Reading 
Literacy Study)]. Miinster, Germany: Waxmann. 

Cheryan, S., & Bodenhausen, G. V. (2000). When positive stereotypes 
threaten intellectual performance: The psychological hazards of “model 
minority” status. Psychological Science, 11, 399-402. doi:10.1111/ 
1467-9280.00277 

Cohen, J. (1988). Statistical power analysis for the behavioral sciences. 
Hillsdale, NJ: Erlbaum. 

Council of the European Union. (2010). The gender pay gap in the member 
states of the European Unio: Quantitative and qualitative indicators. 
Summary of the Belgian presidency report 2010. Brussels, Belgium: 
Author. Retrieved from http://register.consilium.europa.eu/pdf/en/10/ 
st16/st16881-ad01.en10.pdf 

Cuddy, A. J. C., Fiske, S. T., & Glick, P. (2007). The BIAS map: Behaviors 
form Intergroup Affect and Stereotypes. Journal of Personality and 
Social Psychology, 92, 631-648. doi:10.1037/0022-3514.92.4.631 

Cuddy, A. J. C., Glick, P., & Beninger, A. (2011). The dynamics of warmth 
and competence judgments, and their outcomes in organizations. Re- 
search in Organizational Behavior, 31, 73-98. doi:10.1016/j.riob.2011 
10.004 

de Boer, H., Bosker, R. J., & van der Werf, M. P. C. (2010). Sustainability 
of teacher expectation bias effects on long-term student performance. 
Journal of Educational Psychology, 102, 168-179. doi:10.1037/ 
a0017289 

DeZolt, D. M., & Hull, S. H. (2001). Classroom and school climate. In J. 
Worell (Ed.), Encyclopedia of women and gender. Sex similarities and 
differences and the impact of society on gender (pp. 257-264). San 
Diego, CA: Academic Press. 

Drago, R., & Williams, C. (2010). The gender wage gap 2009. Washing- 
ton, DC: Institute for Women’s Policy Research. Retrieved from www 
iwpr.org/publications/pubs/the-gender-wage-gap-2009/at_download/ 
file 

Durik, A. M., Vida, M., & Eccles, J. S. (2006). Task values and ability 
beliefs as predictors of high school literacy choices: A developmental 
analysis. Journal of Educational Psychology, 98, 382-393. doi:10.1037/ 
0022-0663.98.2.382 

Eccles, J. S., Adler, T. F., Futterman, R., Goff, S. B., Kaczala, C. M., 
Meece, J. L., & Midgley, C. (1983). Expectations, values, and academic 
behaviors. In J. T. Spence (Ed.), Achievement and achievement motives: 
Psychological and sociological approaches (pp. 75-146). San Fran- 
cisco, CA: Freeman. 

Eccles, J. S., & Wigfield, A. (2002). Motivational beliefs, values, and 
goals. Annual Review of Psychology, 53, 109-132. doi:10.1146/annureyv 
-psych.53.100901.135153 

Eccles, J. S., Wigfield, A., Harold, R. D., & Blumenfeld, P. C. (1993). Age 
and gender differences in children‘s self- and task perceptions during 
elementary school. Child Development, 64, 830-847. doi:10.2307/ 
31221 

Evans, A. B., Copping, K. E., Rowley, S. J., & Kurtz-Costes, B. (2011). 
Academic self-concept in Black adolescents: Do race and gender ste- 
reotypes matter? Self and Identity, 10, 263-277. doi:10.1080/15298868 
.2010.485358 

Fiedler, K., & Bless, H. (2001). Social cognition. In M. Hewstone & W. 
Stroebe (Eds.), Introduction to social psychology: A European perspec- 
tive (pp. 115-149). Malden, MA: Blackwell. 

Fiske, S. T. (2004). What‘s in a category? Responsibility, intent, and the 
avoidability of bias against outgroups. In A. G. Miller (Ed.), The social 
psychology of good and evil (pp. 127-140). New York, NY: Guilford 
Press. 


Graham, J. W. (2009). Missing data analysis: Making it work in the real 
world. Annual Review of Psychology, 60, 549-576. doi:10.1146/annurev 
.psych.58.110405.085530 

Graham, J. W., Olchowski, A. E., & Gilreath, T. D. (2007). How many 
imputations are really needed? Some practical clarifications of multiple 
imputation theory. Prevention Science, 8, 206-213. doi:10.1007/ 
s11121-007-0070-9 

Hartley, B. L., & Sutton, R. M. (2013). A stereotype threat account of 
boys’ academic underachievement. Child Development, 84, 1716-1733. 
doi:10.1111/cdev.12079 

Hyde, J. S., & Durik, A. M. (2005). Gender, competence, and motivation. 
In A. J. Elliot & C. S. Dweck (Eds.), Handbook of competence and 
motivation (pp. 375-391). New York, NY: Guilford Press. 

Treson, J., & Hallam, S. (2009). Academic self-concepts in adolescence: 
Relations with achievement and ability grouping in schools. Learning 
and Instruction, 19, 201-213. doi:10.1016/j.learninstruc.2008.04.001 

Jacobs, J. E., & Eccles, J. S. (1992). The impact of mothers‘ gender-role 
stereotypic beliefs on mothers’ and children’s ability perceptions. Jour- 
nal of Personality and Social Psychology, 63, 932-944. doi:10.1037/ 
0022-3514.63.6.932 

Jacobs, J. E., Lanza, S., Osgood, D. W., Eccles, J. S., & Wigfield, A. 
(2002). Changes in children’s self-competence and values: Gender and 
domain differences across Grades One through Twelve. Child Develop- 
ment, 73, 509-527. doi:10.1111/1467-8624.00421 

Jones, S. M., & Dindia, K. (2004). A meta-analytic perspective on sex 
equity in the classroom. Review of Educational Research, 74, 443-471. 
doi:10.3102/00346543074004443 

Jussim, L., Eccles, J. S., & Madon, S. (1996). Social perception, social 
stereotypes, and teacher expectations: Accuracy and the quest for the 
powerful self-fulfilling prophecy. In M. P. Zanna (Ed.), Advances in 
experimental social psychology (pp. 281-388). San Diego, CA: Aca- 
demic Press. 

Jussim, L., & Harber, K. D. (2005). Teacher expectations and self-fulfilling 
prophecies: Knowns and unknowns, resolved and unresolved controver- 
sies. Personality and Social Psychology Review, 9, 131-155. doi: 
10.1207/s15327957pspr0902_3 

Kurtz-Costes, B., Rowley, S. J., Harris-Britt, A., & Woods, T. A. (2008). 
Gender stereotypes about mathematics and science and self-perceptions 
of ability in late childhood and early adolescence. Merrill-Palmer Quar- 
terly, 54, 386-409. doi:10.1353/mpq.0.0001 

Marsh, H. W., Seaton, M., Trautwein, U., Liidtke, O., Hau, K.-T., O’ Mara, 
A. J., & Craven, R. G. (2008). The big-fish—little-pond-effect stands up 
to critical scrutiny: Implications for theory, methodology, and future 
research. Educational Psychology Review, 20, 319-350. doi:10.1007/ 
s10648-008-9075-6 

Marsh, H. W., Trautwein, U., Liidtke, O., K6ller, O., & Baumert, J. (2006). 
Integration of multidimensional self-concept and core personality con- 
structs: Construct validation and relations to well-being and achieve- 
ment. Journal of Personality, 74, 403-456. doi:10.1111/j.1467-6494 
.2005.00380.x 

Martinot, D., Bagés, C., & Désert, M. (2012). French children’s awareness 
of gender stereotypes about mathematics and reading: When girls im- 
prove their reputation in math. Sex Roles, 66, 210-219. doi:10.1007/ 
$11199-011-0032-3 

McKown, C., & Weinstein, R. S. (2003). The development and conse- 
quences of stereotype consciousness in middle childhood. Child Devel- 
opment, 74, 498-515. doi:10.1111/1467-8624.7402012 

Meece, J. L., Bower Glienke, B., & Burg, S. (2006). Gender and motiva- 
tion. Journal of School Psychology, 44, 351-373. doi:10.1016/j.jsp.2006 
.04.004 

Moller, J., & Bonerad, E.-M. (2007). Fragebogen zur habituellen Lesemo- 
tivation. [Habitual Reading Motivation Questionnaire]. Psychologie in 
Erziehung und Unterricht, 54, 259-267. 


194 RETELSDORF, SCHWARTZ, AND ASBROCK 


Moller, J., & Marsh, H. W. (2013). Dimensional comparison theory. 
Psychological Review, 120, 544-560. doi:10.1037/a0032459 

Miller, J., Retelsdorf, J., K6ller, O., & Marsh, H. W. (2011). The recip- 
rocal internal/external frame of reference model: An integration of 
models of relations between academic achievement and self-concept. 
American Educational Research Journal, 48, 1315-1346. doi:10.3102/ 
0002831211419649 

Mullis, I. V. S., Martin, M. O., Foy, P., & Drucker, K. T. (2012). PIRLS 
2011 international results in reading. Chestnut Hill, MA: Boston Col- 
lege. 

Muthén, L. K., & Muthén, B. O. (2013). Mplus Version 7.1 [Computer 
software]. Los Angeles, CA: Muthén & Muthén. 

Muzzatti, B., & Agnoli, F. (2007). Gender and mathematics: Attitudes and 
stereotype threat susceptibility in Italian children. Developmental Psy- 
chology, 43, 747-759. doi:10.1037/0012-1649.43.3.747 

Neugebauer, M., Helbig, M., & Landmann, A. (2011). Unmasking the 
myth of the same-sex teacher advantage. European Sociological Review, 
27, 669-689. doi:10.1093/esr/jcq038 

Nguyen, H.-H. D., & Ryan, A. M. (2008). Does stereotype threat affect test 
performance of minorities and women? A meta-analysis of experimental 
evidence. Journal of Applied Psychology, 93, 1314-1334. doi:10.1037/ 
a0012702 

Organization for Economic Co-Operation and Development. (2010). PISA 
2009 results: What students know and can do. Student performance in 
reading, mathematics and science (Vol. J). Paris, FranceAuthorhttp:// 
dx.doi.org/10.1787/978926409 1450-en 

Plante, I., de la Sablonniére, R., Aronson, J. M., & Théorét, M. (2013). 
Gender stereotype endorsement and achievement-related outcomes: The 
role of competence beliefs and task values. Contemporary Educational 
Psychology, 38, 225-235. doi:10.1016/j.cedpsych.2013.03.004 

Retelsdorf, J., Becker, M., Koller, O., & Moller, J. (2012). Reading 
development in a tracked school system: A longitudinal study over 3 
years using propensity score matching. British Journal of Educational 
Psychology, 82, 647-671. doi:10.1111/).2044-8279.2011.02051.x 

Retelsdorf, J., K6ller, O., & Moller, J. (2011). On the effects of motivation 
on reading performance growth in secondary school. Learning and 
Instruction, 21, 550-559. doi:10.1016/j.learninstruc.2010.11.001 

Retelsdorf, J., Kdller, O., & Méller, J. (2014). Reading achievement and 
reading self-concept: Testing the reciprocal effects model. Learning and 
Instruction, 29, 21-30. doi:10.1016/j.learninstruc.2013.07.004 

Rouland, K. K., Rowley, S. J., & Kurtz-Costes, B. (2013). Self-views of 
African-American youth are related to the gender stereotypes and ability 
attributions of their parents. Self and Identity, 12, 382-399. doi:10.1080/ 
15298868.2012.682360 

Rowley, S. J., Kurtz-Costes, B., Mistry, R., & Feagans, L. (2007). Social 
status as a predictor of race and gender stereotypes in late childhood and 
early adolescence. Social Development, 16, 150-168. doi:10.1111/4 
.1467-9507.2007.00376.x 

Rubie-Davies, C., Hattie, J. A. C., & Hamilton, R. (2006). Expecting the 
best for students: Teacher expectations and academic outcomes. British 
Journal of Educational Psychology, 76, 429-444. doi:10.1348/ 
000709905X53589 

Schmenk, B. (2004). Language learning: A feminine domain? The role of 
stereotyping in constructing gendered learner identities. TESOL Quar- 
terly, 38, 514-524. doi:10.2307/3588352 


Schneider, D. J. (2004). The psychology of stereotyping. New York, NY: 
Guilford Press. 

Shavelson, R. J., Hubner, J. J., & Stanton, G. C. (1976). Self-concept: 
Validation of construct interpretations. Review of Educational Research, 
46, 407-441. doi:10.3102/00346543046003407 

Skaalvik, S., & Skaalvik, E. M. (2004). Gender differences in math and 
verbal self-concept, performance expectations, and motivation. Sex 
Roles, 50, 241-252. doi:10.1023/B:SERS.0000015555.40976.e6 

Steele, C. M. (1997). A threat in the air: How stereotypes shape intellectual 
identity and performance. American Psychologist, 52, 613-629. doi: 
10.1037/0003-066X.52.6.613 

Swinson, J., & Harrop, A. (2009). Teacher talk directed to boys and girls 
and its relationship to their behaviour. Educational Studies, 35, 515-524. 
doi:10.1080/030556909028839 13 

Tajfel, H. (1981). Social stereotypes and social groups. In J. C. Turner & 
H. Giles (Eds.), Intergroup behavior (pp. 144-167). Oxford, England: 
Blackwell. 

Tajfel, H., & Turner, J. C. (1986). The social identity theory of intergroup 
behavior. In S. Worchel & W. G. Austin (Eds.), Psychology of inter- 
group relations (pp. 7-24). Chicago, IL: Nelson-Hall. 

Tiedemann, J. (2000). Parents‘ gender stereotypes and teachers‘ beliefs as 
predictors of children‘s concept of their mathematical ability in elemen- 
tary school. Journal of Educational Psychology, 92, 144-151. doi: 
10.1037/0022-0663.92.1.144 

Tiedemann, J. (2002). Teachers‘ gender stereotypes as determinants of 
teacher perceptions in elementary school mathematics. Educational 
Studies in Mathematics, 50, 49-62. doi:10.1023/A:1020518104346 

Tymms, P. (2004). Effect sizes in multilevel models. In I. Schagen & K. 
Elliot (Eds.), But what does it mean? The use of effect sizes in educa- 
tional research (pp. 55-66). London, England: National Foundation for 
Educational Research. 

Watt, H. M. G., & Eccles, J. S. (Eds.). (2008). Gender and occupational 
outcomes: Longitudinal assessments of individual, social, and cultural 
influences. Washington, D.C: American Psychological Association. doi: 
10.1037/11706-000 

Wigfield, A., & Eccles, J. S. (2000). Expectancy-value theory of achieve- 
ment motivation. Contemporary Educational Psychology, 25, 68-81. 
doi:10.1006/ceps.1999.1015 

Wigfield, A., Eccles, J. S., Yoon, K. S., Harold, R. D., Arbreton, A. J. A., 
Freedman-Doan, C., & Blumenfeld, P. C. (1997). Change in children‘s 
competence beliefs and subjective task values across the elementary 
school years: A 3-year study. Journal of Educational Psychology, 89, 
451-469. doi:10.1037/0022-0663.89.3.451 

Woolfolk, A. E. (2010). Educational psychology. Columbus, OH: Pearson/ 
Allyn & Bacon. 

Worrall, N., & Tsarna, H. (1987). Teachers‘ reported practices towards 
girls and boys in science and languages. British Journal of Educational 
Psychology, 57, 300-312. doi:10.1111/j.2044-8279.1987.tb00859.x 

Wu, M. L., Adams, R. J., & Wilson, M. (1998). ConQuest: Generalized 
item response modeling software [Computer software]. Melbourne, Aus- 
tralia: Australian Council for Educational Research. 


Received October 27, 2013 
Revision received April 29, 2014 
Accepted May 3, 2014 @ 


Journal of Educational Psychology 
2015, Vol. 107, No. 1, 195-206 


© 2014 American Psychological Association 
0022-0663/15/$12.00 _http://dx.doi.org/10.1037/a0036981 


Gender Differences in the Effects of a Utility-Value Intervention to Help 
Parents Motivate Adolescents in Mathematics and Science 


Christopher S. Rozek, Janet S. Hyde, 
and Ryan C. Svoboda 


Chris S. Hulleman 


University of Virginia 


University of Wisconsin—Madison 


Judith M. Harackiewicz 


University of Wisconsin—Madison 


A foundation in science, technology, engineering, and mathematics (STEM) education is critical for 
students’ college and career advancement, but many U.S. students fail to take advanced mathematics and 
science classes in high school. Research has neglected the potential role of parents in enhancing students’ 
motivation for pursuing STEM courses. Previous research has shown that parents’ values and expec- 
tancies may be associated with student motivation, but little research has assessed the influence of parents 
on adolescents through randomized experiments. Harackiewicz, Rozek, Hulleman, and Hyde (2012) 
documented an increase in adolescents’ STEM course-taking for students whose parents were assigned 
to a utility-value intervention in comparison to a control group. In this study, we examined whether that 
intervention was equally effective for boys and girls and examined factors that moderate and mediate the 
effect of the intervention on adolescent outcomes. The intervention was most effective in increasing 
STEM course-taking for high-achieving daughters and low-achieving sons, whereas the intervention did 
not help low-achieving daughters (prior achievement measured in terms of grade point average in 
9th-grade STEM courses). Mediation analyses showed that changes in STEM utility value for mothers 
and adolescents mediated the effect of the intervention on 12th-grade STEM course-taking. These results 
are consistent with a model in which parents’ utility value plays a causal role in affecting adolescents’ 
achievement behavior in the STEM domain. The findings also indicate that utility-value interventions 
with parents can be effective for low-achieving boys and for high-achieving girls but suggest modifi- 


cations in their use with low-achieving girls. 


Keywords: academic motivation, educational intervention, STEM motivation, gender differences 


In the United States, national education policies have focused on 
improving the performance of U.S. students relative to their inter- 
national peers, particularly in areas related to science, technology, 
engineering, and mathematics (STEM; National Science Founda- 
tion [NSF], 2012). Of particular concern are students’ decisions 
not to take advanced science and mathematics courses in high 
school. For example, only 35% of high school graduates have 
taken precalculus and only 39% have taken physics (NSF, 2012). 
Moreover, although gender gaps have closed for course-taking in 
some STEM areas, they persist in others. For example, although 


girls and boys take calculus at the same rate, boys are more likely 
to take physics than girls are (42% vs. 36%) and are more likely to 
take engineering in high school (6% vs. 1%; NSF, 2012). Recently, 
a number of interventions have been implemented to increase 
STEM motivation and to close gender gaps (e.g., Harackiewicz et 
al., 2014; Hulleman & Harackiewicz, 2009; Miyake et al., 2010; 
Walton & Cohen, 2011). Here, we report on the moderators and 
mediators of an intervention shown to help parents motivate their 
adolescents to take mathematics and science courses in high school 
(Harackiewicz, Rozek, Hulleman, & Hyde, 2012). We probed 
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whether the intervention was equally effective for boys and girls 
depending on their prior performance in mathematics and science 
courses and what factors mediated the effect of the intervention on 
students’ STEM course-taking. 


Theoretical Framework 


Numerous theoretical models have been proposed to help ex- 
plain student motivation and persistence in academics. One com- 
prehensive model is Eccles’s expectancy-value theory (Eccles- 
Parsons et al., 1983), which frames the research reported here. The 
expectancy-value model holds that expectations for success (ex- 
pectancy) and perceived task value are direct predictors of 
achievement and achievement choices (e.g., Eccles-Parsons et al., 
1983; Simpkins, Davis-Kean, & Eccles, 2006; Updegraff, Eccles, 
Barber, & O’Brien, 1996). In Eccles’s model, expectancy for 
success is defined as how well an individual thinks he or she will 
do on an ensuing task (Eccles-Parsons et al., 1983). Task value 
consists of attainment value (how a task is related to one’s iden- 
tity), intrinsic value (enjoyment of the task), utility value (per- 
ceived usefulness of a task), and cost (costs to the individual of 
task engagement, such as what one concedes by choosing one task 
over others). 

The expectancy-value model proposes that adolescents’ per- 
ceived task values and expectations for success are the most 
proximal predictors of STEM-related achievement choices. Previ- 
ous research supports this hypothesis, with students being more 
likely to choose to take mathematics and science courses when 
they have either high expectations for success or value for those 
courses or both (e.g., Eccles, Barber, Updegraff, & O’Brien, 1998; 
Simpkins et al., 2006; Updegraff et al., 1996; Watt, 2005; Watt, 
Eccles, & Durik, 2006). In addition, both expectancies and values 
predict classroom performance (e.g., Hulleman, Durik, 
Schweigert, & Harackiewicz, 2008; Watt, 2005). 


Parents’ Influence on Values and Expectancies 


The expectancy-value model proposes that, more distally, key 
socializers, such as parents, play an important role in shaping 
adolescents’ values. Previous research has found that parents’ 
values and expectancies for success for their child are linked to 
adolescents’ values in a variety of domains, including mathematics 
and science (Jodl, Michael, Malanchuk, Eccles, & Sameroff, 2001; 
Simpkins, Fredericks, & Eccles, 2012). Much of this research has 
concentrated on adolescents and their achievement motivation in 
STEM courses throughout middle school and high school (Riegle- 
Crumb & King, 2010; Watt et al., 2012). Parents’ values for 
mathematics and science are associated with adolescents’ values in 
mathematics and science, which, subsequently, are associated with 
adolescents’ educational choices and outcomes (Jodl et al., 2001; 
Simpkins et al., 2012). 

Parents’ expectancies for their adolescents have also been asso- 
ciated with their adolescents’ expectancies for success in mathe- 
matics and science and educational outcomes, and these associa- 
tions are even stronger than the associations between parents’ 
values and adolescents’ outcomes (Bleeker & Jacobs, 2004; Frome 
& Eccles, 1998; Jacobs & Eccles, 1992; Yee & Eccles, 1988). For 
instance, if parents have high expectancies for their adolescents in 
STEM, they are more likely to have adolescents with high expec- 


tancies and better educational outcomes in STEM courses. If 
parents have low expectancies, they are more likely to have ado- 
lescents with low expectancies and worse educational outcomes in 
STEM (Jacobs & Eccles, 1992). However, studies involving the 
associations between parents’ and adolescents’ expectations and 
values are typically correlational in nature and thus are unable to 
test for a causal effect of parents’ values and expectations on 
adolescents’ values and expectations. 

Whereas multiple studies have focused on the role of parental 
support—such as involvement and support for autonomy—in re- 
lation to children’s school outcomes (Grolnick & Ryan, 1989; 
Grolnick, Ryan, & Deci, 1991; Ratelle, Larose, Guay, & Senécal, 
2005; Spera, 2005), here we focus on parents’ values for their 
child’s education. Such values may be a key resource that educa- 
tors can leverage to enhance student outcomes, such as STEM 
course-taking (Harackiewicz et al., 2012). From a process perspec- 
tive, it is important to understand how parents’ values are trans- 
mitted to children. Some researchers have examined the specific 
parental behaviors that contribute to value transmission from par- 
ents to adolescents, such as encouragement, provision of educa- 
tional and other materials, and coactivity (e.g., Simpkins et al., 
2012). However, parental behaviors are not the only means of 
value transmission. Because students’ perceptions are featured 
heavily in the expectancy-value model, it is important to examine 
whether adolescents are even aware of their parents’ values. If 
adolescents are unaware of their parents’ utility-value beliefs, 
parents’ values may have smaller effects on their adolescents’ 
attitudes and behaviors. Such perceptions could serve as an im- 
portant indicator that parental values are being communicated 
(Paulson & Sputa, 1996; Spera, 2006; Wood, Kurtz-Costes, & 
Copping, 2011). 


Gender Differences in Expectancies and Values 


Two aspects of Eccles’s model have been hypothesized to show 
gender differences that, in turn, may explain differences in STEM 
achievement: gender differences in expectancies and gender dif- 
ferences in values (e.g., Eccles, Wigfield, Harold, & Blumenfeld, 
1993; Updegraff et al., 1996). Compared with boys, girls have 
lower expectancies for success in STEM domains (Yee & Eccles, 
1988). This difference predicts increased enrollment in these 
courses for boys (Watt et al., 2012). Gender differences in expec- 
tancies for success can be influenced by socializers, especially 
parents. Research indicates that parents can have exaggerated 
expectancies for success in mathematics and science for their sons 
and diminished expectancies for success for their daughters 
(Eccles et al., 1993; Yee & Eccles, 1988). 

The amount of value that boys and girls place on mathematics 
and science as well as the number of valued domains may influ- 
ence gender differences in STEM achievement choices as well. 
The results are mixed on whether boys and girls differ in how 
much they value STEM domains, with many studies showing no 
gender differences in levels of STEM value (Eccles, 2009). How- 
ever, there are gender differences in the number of valued do- 
mains, suggesting that women place high value on more domains 
(including non-STEM domains) than men do, which can lead to 
even high levels of STEM value being relatively less important for 
women (Eccles, 2007; Eccles, Barber, & Jozefowicz, 1999; 
Thoman, Arizaga, Smith, Story, & Soncuya, 2013). Additionally, 
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women, compared with men, tend to believe it is more important 
to make occupational sacrifices for the family and to have a job 
that helps people, which is one of the strongest predictors for 
women not pursing STEM careers (Eccles, 2007). Men, however, 
are more likely to value making money and having a successful 
career. This difference may be especially crucial for talented girls, 
because they are caught between their beliefs in gender stereotypes 
on the one hand and their accomplishments in mathematics and 
science courses on the other (Eccles, 2007). Thus, high-achieving 
girls may shy away from enrolling in challenging STEM courses 
because of their belief in cultural stereotypes. Parents and other 
socializers, whose values are influenced by cultural stereotypes, 
may transmit these stereotyped beliefs to their adolescents. 


Utility- Value Interventions 


Recent studies have focused on understanding the particular role 
of utility value (UV) in achievement behaviors (Durik & Harack- 
iewicz, 2007; Hulleman et al., 2008; Hulleman, Godes, Hendricks, 
& Harackiewicz, 2010; Hulleman & Harackiewicz, 2009; Kauff- 
man & Husman, 2004; Shechter, Durik, Miyamoto, & Harackie- 
wicz, 2011). For example, Hulleman et al. (2008) found that 
students’ perceptions of utility value predicted achievement in 
both a college classroom and a high school sports camp. In another 
study, students who had higher utility value for their studies 
persisted longer and performed better than those who had lower 
levels (Vansteenkiste, Simons, Lens, Sheldon, & Deci, 2004). 

On the basis of this correlational research, researchers have recently 
begun to manipulate utility value with interventions in the lab, class- 
room, and home (Acee & Weinstein, 2010; Durik & Harackiewicz, 
2007; Harackiewicz et al., 2012; Hulleman et al., 2010; Hulleman & 
Harackiewicz, 2009). They have targeted utility value in particular 
because it is likely that perceptions of utility value can be changed 
with interventions. Attainment and intrinsic values are more intrinsic 
and therefore would be difficult for an outside entity to manipulate. 
Utility value, in contrast, should be amenable to change by an 
intervention. Studies have found that these utility value interven- 
tions cause an increase in interest and performance in the subject, 
including STEM topics (Durik & Harackiewicz, 2007; Hulleman 
et al., 2010; Hulleman & Harackiewicz, 2009; Shechter et al., 
2011). Although these UV interventions have had positive effects 
on motivation, these effects have typically been moderated by past 
performance or expectations for success, which is consistent with 
expectancy-value theory (Nagengast et al., 2011; Trautwein et al., 
2012). Individuals with high expectations for success responded 
most positively when told why a topic was relevant to their lives 
(e.g., Durik & Harackiewicz, 2007), whereas individuals with low 
expectations for success showed no positive response or responded 
negatively when given relevance (UV) information (for a review, 
see Durik, Hulleman, & Harackiewicz, 2013). These results sug- 
gest that it is critically important to consider the role of expecta- 
tions and past performance in studies involving utility-value inter- 

~ ventions. 


Indirect Utility-Value Interventions 


Based on the documented potential of UV information to pro- 
mote motivation for many individuals and the associations be- 
tween parents’ values and their adolescent’s values in correlational 


research, we implemented a utility-value intervention aimed at 
parents (Harackiewicz et al., 2012). The ultimate goal of this 
intervention was to increase adolescents’ STEM UV and STEM 
course-taking in high school. Previous research had not used 
randomized experiments to test the influence of parents on ado- 
lescents’ utility value and achievement choices, but this study was 
able to evaluate the role of parents by randomly assigning them to 
an experimental UV intervention versus control condition. In the 
experimental condition, parents in an ongoing longitudinal study 
were given information about the relevance or usefulness (utility 
value) of mathematics and science for their adolescent. Parents in 
the control group received no information. 

The results indicated that adolescents whose parents were in the 
intervention group took almost a semester more of mathematics 
and science classes during the last 2 years of high school than 
those whose parents were in the control group. These results 
indicated that parents can play a crucial role in increasing impor- 
tant adolescent achievement choices, such as advanced STEM 
course-taking. Although this intervention was effective for adoles- 
cents on average, it is important to consider the possibility that this 
intervention effect may vary as a function of gender and past 
performance, as has been observed in previous studies. It is also 
important to examine how this intervention worked to influence 
adolescents’ course-taking. 


The Current Study 


This study goes beyond our previous evaluation of the utility- 
value intervention described above, to investigate for whom the 
intervention worked best and how it worked. The first research 
question asked whether gender and past performance (i.e., 9th- 
grade math and science grade point average) moderated the effects 
of the intervention. Previously, we found a main effect of the 
intervention on course-taking in the last 2 years of high school; 
later we coded past performance from high-school transcripts to 
use as a proxy for expectancies to test for an expectancy (prior 
performance) by value (intervention) interaction. Given the under- 
representation of women in many STEM fields (Halpern et al., 
2007) and previously documented gender differences in expectan- 
cies and values in the STEM domain, we tested both gender and 
past performance as moderators of the intervention effect. Al- 
though in an earlier paper we reported that the intervention effect 
did not differ as a function of gender (Harackiewicz et al., 2012), 
we hypothesized that gender differences might emerge once we 
considered students’ past performance. We therefore tested for an 
interaction among the intervention, gender, and past performance 
in STEM classes. 

Using a mediation model, the second research question asked 
what mechanisms accounted for the effect of the intervention on 
students’ course-taking (see Figure 1 for the theoretical model). 
We hypothesized that the intervention would lead to increased 
STEM UV for parents, which we assessed with questionnaires 
given to mothers of the adolescents. This increase in mothers’ 
STEM UV was then predicted to be associated with an increase in 
adolescents’ perceptions of parents’ STEM values and adoles- 
cents’ STEM UV. To provide the strongest test of mediation, we 
capitalized on the longitudinal design of the original study. The 
outcome variable was 12th-grade STEM course-taking. Mothers’ 
perceived STEM UV, adolescents’ perceptions of parents’ STEM 
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Figure 1. Theoretical model. STEM = science, technology, engineering, 
and mathematics; UV = utility value. 


values, and adolescents’ perceived STEM utility value were mea- 
sured in the summer after 11th grade and therefore could be tested 
as mediators in the analyses of the effects of the intervention 
(which occurred during 10th and 11th grades) on 12th-grade 


STEM course-taking. These variables were predicted to mediate: 


the effect of the intervention on 12th-grade STEM course-taking. 
Method 


Participants 


The sample comprised families participating in the longitudinal 
Wisconsin Study of Families and Work (WSFW; for details on 
recruitment, see Hyde, Klein, Essex, & Clark, 1995). The current 
sample consisted of 188 adolescents (88 girls, 100 boys) and their 
parents who participated in a randomized experiment during high 
school (Harackiewicz et al., 2012). With regard to ethnicity, 90% 
of the adolescents were White (not of Hispanic origin), 2% were 
African American, 1% were Native American, and 7% were bira- 
cial or multiracial; this distribution is characteristic of the state of 
recruitment, in which 90% of the population is White (U.S. Census 
Bureau, 2006). At the time of data collection, participants attended 
108 different high schools, increasing the generalizability of the 
findings. In 2010, the majority of adolescents (98%) had graduated 
from high school, and 94% reported plans to attend college or 
technical school. Average parents’ years of education was 15.42 
years (SD = 1.92) ona scale where 12 years is equivalent to high 
school graduation or GED completion. 


Experimental Procedure 


The intervention was implemented in October 2007 (10th grade) 
and again in January 2009 (11th grade). Families were followed 
through the teens’ graduation from high school in June 2010. 
Families were randomly assigned to one of two experimental 
conditions and were blocked on gender of teen and mothers’ 
educational level. Of these 188 families, 83 were in the experi- 
mental group and 105 were in the control group. 

The intervention materials (two brochures and a website) were 
delivered exclusively to parents and focused on the usefulness of 
mathematics and science for adolescents. In particular, these ma- 


terials explored potential connections between mathematics and 
science and current and future goals of adolescents (Harackiewicz 
et al., 2012). A first brochure, titled “Making Connections: Help- 
ing Your Teen Find Value in School,” was sent to each household, 
addressed to the parents, in October of 10th grade. A second 
brochure, titled “Making Connections: Helping Your Teen with 
the Choices Ahead,” was sent to each parent separately in January 
of 11th grade. This mailing included a letter giving them access to 
a dedicated, password-protected website called “Choices Ahead.” 
Additionally, in the spring of 11th grade, parents in the experi- 
mental group were asked to complete an online questionnaire to 
evaluate the Choices Ahead website, which resulted in more par- 
ents visiting the website. A high percentage of parents (86%) 
reported using these resources, and a high percentage of adoles- 
cents (75%) reported exposure to this information. Parents in the 
control group did not receive any of these, materials. 

The 10th-grade brochure provided information about the impor- 
tance or usefulness of mathematics and science in daily life and for 
various careers; it also provided parents with information about 
how to talk with adolescents about these issues. The 11th-grade 
brochure focused on these same themes but with different exam- 
ples, and it gave greater emphasis to everyday activities (e.g., 
video games, cell phones) and preparation for college and careers. 
The 11th-grade brochure provided additional information for par- 
ents about communicating with their children about these issues 
and personalizing the relevance of mathematics and science for 
their 11th grader. The website featured clickable links to resources 
about STEM fields and careers. It also presented interviews with 
current college students who explained the usefulness of the math- 
ematics and science courses that they had taken in high school. 
Parents were able to e-mail specific links from the site to their 
teens. 


Measures 


STEM courses taken in 12th grade and prior performance. 
Transcripts were obtained for 181 of the 188 students in the 
sample and came from 108 different high schools. Receipt of 
transcripts did not vary due to experimental condition or gen- 
der. The remaining sample of 181 families included 47 girls and 
53 boys in the control group and 39 girls and 42 boys in the 
intervention group. For the outcome measure, we coded tran- 
scripts for the number of semesters of mathematics and science 
taken during 12th grade (12th-grade STEM course-taking). 
(Note that Harackiewicz et al. (2012) used number of mathe- 
matics and science courses taken in 11th and 12th grades as the 
outcome variable. Here we used just the number taken in 12th 
grade, so that a mediation model could be tested with mediators 
measured in 11th grade.) 

For the measure of prior STEM performance, we created a 
standardized measure of ninth-grade STEM grade point average 
(GPA) by individually calculating each adolescent’s GPA for 
mathematics and science courses taken in ninth grade on a GPA 
scale that ranged from 0 (F) to 4.0 (A/A+). The scale distin- 
guished between grades by one third of a grade point (e.g., A = 
4.0, A— = 3.67, B+ = 3.33). The final measure was a weighted, 
cumulative STEM GPA from ninth grade that took into account 
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the number of credits each course counted to weight the course 
grade. 

Mother’s STEM utility value, adolescent’s perceptions of 
parents’ values, and adolescents’ future STEM value. 
Questionnaires given to mothers and adolescents in the summer 
after 11th grade included one measure from mothers (mothers’ 
STEM UV for their adolescent) and two adolescent measures 
(perceptions of parents’ STEM values and adolescents’ STEM 
value). Response rates on the questionnaires were 83% for mothers 
and 77% for adolescents. All measures were based on items 
developed by Eccles and colleagues (e.g., Eccles & Wigfield, 
2002; Eccles-Parsons et al., 1983). Mothers’ STEM UV was 
measured with four items that asked about the mother’s percep- 
tions of the utility value of mathematics and science for her 
adolescent (e.g., In general, how useful will [biology] be for your 
teen in the future? a = .79). This question was asked about four 
STEM topics: biology, mathematics, chemistry, and physics. Re- 
sponses were on a scale from | (not at all useful) to 5 (very useful). 
Fathers also reported on STEM UV for their adolescent. However, 
the response rate for fathers at 11th grade was only 62%, creating 
substantial missing data. Therefore, we used only the variable from 
mothers. 

For adolescents’ perceptions of parents’ values, adolescents 
rated how important their parents thought mathematics and science 
would be in their lives with two items (e.g., My parents think math 
and science are important for my life; a = .78). Adolescents’ 
perception of the value of mathematics and science for their future 
(future STEM value) was measured with four items that focused 
on the current and future value of mathematics and science for 
themselves (e.g., Math and science are important for my future; 
a = .79). Adolescent measures were rated on a scale from 1 
(strongly disagree) to 7 scale (strongly agree). 

Parents’ education. In the current sample (NV = 181), mothers 
averaged 15.42 years of education (SD = 2.10), and fathers also 
averaged 15.42 years of education (SD = 2.41). A variable of 
parents’ average years of education (M = 15.42, SD = 1.92) was 
created by averaging these two variables (r = .44). In this paper, 
we use mothers’ education for analyses involving mother variables 
and parents’ education for analyses not involving mothers’ reports. 


Overview of Analyses 


We used multiple regression followed by structural equation 
modeling to analyze these data in two stages. First, multiple 
regression was used to investigate the direct effects of the predic- 
tors on 12th-grade STEM courses taken, which was the primary 
outcome variable. Second, a structural equation model was esti- 
mated based on the theoretical model (see Figure 1) to examine the 
relationships among the predictors, mediators (mothers’ UV, per- 
ceptions of parents’ values, and adolescents’ future STEM value), 
and the outcome in a single model. In this model, we tested 
whether the total indirect effect of the predictors on the outcome 
Z through the mediators was significant (Preacher & Hayes, 2008). 
Cases with missing data were included by using full information 
maximum likelihood methods (Arbuckle, 1996). 

There were seven predictors involving the intervention and the 
moderators of the intervention (base predictors): the intervention 
(coded as 1 for intervention group and —1 for control group), 
adolescent’s gender (coded 1 for boys and —1 for girls), ninth- 


grade STEM GPA (measured continuously and standardized), and 
two- and three-way interactions (the interaction of the intervention 
by adolescent’s gender, the interaction between the intervention 
and ninth-grade STEM GPA, the interaction between adolescent’s 
gender and ninth-grade STEM GPA, and the three-way interaction 
among the intervention, adolescent’s gender, and ninth-grade 
STEM GPA). Finally, we included a term to test parental educa- 
tion. 


Results 


Zero-order correlations and descriptive statistics for all variables 
are shown in Table 1, separately by adolescent’s gender. 


Multiple Regression Model of Direct Effects on 
Course-Taking 


To address the first research question, we regressed 12th-grade 
STEM courses taken on the base predictors and parents’ educa- 
tion.' For 12th-grade STEM courses taken, there was one signif- 
icant effect: the three-way interaction among the intervention, 
adolescent’s gender, and ninth-grade STEM GPA (z = —2.44, p < 
.05, 8 = —.18).” In contrast to the main effect of the intervention 
reported by Harackiewicz et al. (2012), the pattern of the three- 
way interaction (see Figure 2) suggests that, when prior perfor- 
mance and gender are taken into consideration, the intervention 
increased course-taking for low-GPA boys (8 = .27, p < .05) and 
high-GPA girls (8 = .22, p < .10), whereas the intervention did 
not help low-GPA girls (8 = —.20, trend toward a negative effect 
of the intervention) and had no effect on high-GPA boys 
(8 = —.04). The graph of the three-way interaction in Figure 2, as 
for all interaction graphs in this paper, follows the convention of 
graphing high values at 1 SD above the mean of GPA and low 
values at 1 SD below the mean (Aiken & West, 1991). 


Structural Equation Model 


To address the second research question, we used structural 
equation modeling in Mplus to test whether the direct effect of the 
intervention (as moderated by gender and prior STEM perfor- 
mance) on 12th-grade STEM course-taking was mediated by in- 
direct effects through the mediators. In the model (see Figure 1), 
we estimated paths from the base predictors (the intervention, 
gender, prior STEM performance, and their interactions) to moth- 
ers’ STEM UV, perceptions of parents’ values, and adolescents’ 
future STEM value. To be consistent with previous analyses (Har- 
ackiewicz et al., 2012), we also included mothers’ years of edu- 
cation as a predictor of mothers’ STEM UV and of STEM course- 
taking. In accordance with the theoretical model, mothers’ STEM 


' The results remain the same if mothers’ education is substituted for 
parents’ education here. The three-way interaction is still the only signif- 
icant predictor (z = —2.39, p < .05, B = —.18). 

? These regression analyses were repeated with STEM course-taking in 
11th and 12th grades as the outcome measure, the one used in the Harac- 
kewicz et al. (2012) paper. The results were the same, that is the three-way 
interaction among intervention, gender, and prior performance signifi- 
cantly predicted 11th- plus 12th-grade STEM course-taking. We report 
results in detail here only for the 12th-grade course-taking outcome, to 
preserve the temporal sequence for mediation analyses. 
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Table 1 
Zero-Order Correlations and Descriptive Statistics for Major Variables by Gender 
Variable 1 2 3 4 5 6 7 
1. Ninth-grade STEM GPA — 0.34"" 036% 0.26" 0.26" 0.24* 0.18 
2. Mothers’ STEM UV 0.21 — 0.54" 0.39" 0.27" 0.30* 0.27" 
3. Adolescents’ future STEM UV 0.40** OS26 a 0557 0.34 0.28" 0.15 
4. Perceptions of parents’ values O87 0.39"* 0.61" — 0.15 0.25* 0.16 
5. STEM courses (12th grade) 0.16 0.34" 0.36" 0.18 —_ 0.11 0.04 
6. Parents’ education 0.42™* 0.15 0.10 0.17 0.26" — 0.79™ 
7. Mothers’ education 0.42** Ona 0.26" 0.34" 0.21 0.86"* = 
Girls, M (SD) 3.15 (0.84) 4.08 (0.79) 5.23 (1.43) 5.75 (1.06) 327771) 15.35 (2.09) 15.41 (2.33) 
Boys, M (SD) 2.92 (0.88) 4.11 (0.81) 5.03 (1.63) 5.62 (1.26) 3.45 (1.85) 15.48 (1.76) 15.43 (1.88) 


Note. Correlations above the diagonal are for boys. Correlations below the diagonal are for girls. There were no mean differences due to gender. STEM = 


science, technology, engineering, and mathematics; GPA = grade point average; UV = utility value. 


“505. “p< 0b 


UV was an additional predictor of perceptions of parents’ values 
and adolescents’ future STEM value. Furthermore, perception of 
parents’ values was a predictor of adolescents’ future STEM value. 
Additionally, paths were estimated from the base predictors, moth- 
ers’ STEM UV, and adolescents’ future STEM value to 12th-grade 
STEM courses taken. Thus, by examining the indirect effects of 
the base predictors through the mediators to STEM course-taking, 
this model tested whether the intervention, as moderated by GPA 
and adolescent’s gender, influenced STEM course-taking through 
mothers’ STEM UV, adolescents’ perceptions of parents’ values, 
and adolescents’ future STEM value. Because this is a saturated 
model, it does not allow for a meaningful test of model fit. 
Overall, the model accounted for 16.8% of the variance in 
12th-grade STEM course-taking, 13.9% of the variance in moth- 
ers’ STEM UV, 26.7% of the variance in perceptions of parents’ 
values, and 50.8% of the variance in adolescents’ future STEM 
value. See Figure 3 for the path models showing these results. 
Effects on mothers’ STEM UV. The base predictors and 
years of mothers’ education were used to predict mothers’ STEM 
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Semesters of 12th Grade STEM Courses 
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Figure 2. Direct effects of the intervention, adolescent’s gender, and 
ninth-grade STEM GPA on STEM course-taking (12th grade). Predicted 
values were generated for high (1 SD above the mean) and low (—1 SD) 
ninth-grade STEM GPA from the multiple regression models. Error bars 
represent +1 SEM. STEM = science, technology, engineering, and math- 
ematics; GPA = grade point average; SEM = standard error of the mean. 


UV. There was a nearly significant effect of ninth-grade STEM 
GPA (z = 1.94, p = .06, 8 = .17) showing a trend for mothers to 
perceive more STEM utility value when their adolescent had a 
higher ninth-grade STEM GPA. In addition, the predicted three- 
way interaction among the intervention, adolescent’s gender, and 
ninth-grade STEM GPA was significant (¢ = —1.96, p = .05, 
B = —.16); it is graphed in Panel A of Figure 4. The pattern of this 
interaction effect is similar to the one for the course-taking out- 
come in the multiple regression analysis (see Figure 2). Finally, 
mothers’ education was a significant predictor of mothers’ STEM 
UV (z = 2.32, p < .05, B = .20), such that mothers with more 
years of education showed higher levels of STEM LIMs. 

Effects on perceptions of parents’ values. The base predic- 
tors and mothers’ STEM UV were used to predict adolescents’ 
perceptions of parents’ values. There were significant effects of 
ninth-grade STEM GPA (z = 2.64, p < .05, B = .23), such that 
parents were perceived as seeing the value of STEM course-taking 
more when the adolescent had a higher STEM GPA. The two-way 
interaction between adolescent’s gender and the intervention was 
significant (z = 2.41, p < .05, B = .19), suggesting that the 
intervention increased boys’ perceptions of parents’ values and 
decreased girls’ perceptions of parents’ values; however, this two- 
way interaction was qualified by the three-way interaction among 
the intervention, adolescent’s gender, and ninth-grade STEM 
GPA, which was nearly significant (¢ = —1.89, p = .06, 
& = —.17). The pattern of the interaction is similar to the one for 
course-taking; in particular, the intervention appeared to decrease 
low-GPA girls’ perceptions of their parents’ values for them (see 
Figure 3, Panel A, and Figure 4, Panel B). That is, low-GPA girls 
in the intervention group perceived a lack of support for STEM 
from their parents. Finally, mothers’ STEM UV was a significant 
predictor of adolescents’ perceptions of parents’ values (z = 3.70, 
p < .01, B = .29), such that mothers with higher levels of STEM 
UV tended to have adolescents with higher levels of perceptions of 
parents’ values. 

Effects on adolescents’ future STEM value. The base pre- 
dictors, mothers’ STEM UV, and perceptions of parents’ values 


> The model was also tested using parents’ education instead of mothers’ 
education, and the results for the overall model did not change; however, 
parents’ education was a nonsignificant predictor of mothers’ STEM UV 
(z = 1.77, p > .05, B = .15). 
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A. Intervention Effects for Girls 
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B. Intervention Effects for Boys 
Mothers’ 
Education 

Utility Value 
Intervention |X 











Perceptions of 
Parents’ Values 






12* Grade 
, STEM Course- 
Taking 


Adolescents’ Future 
STEM Value (11) 
Figure 3. Empirical path model. Only significant paths are shown. The effect of the intervention on STEM 
course-taking in 12th grade differs by gender and ninth-grade STEM GPA and is mediated by mother’s utility 
value (UV), perceptions of parents’ values, and adolescents’ future STEM value. The different intervention 
effects are shown (A) for girls and (B) for boys. Dashed line indicates p = .06 (the three-way interaction 


involving the intervention to perceptions of parents’ values). STEM = science, technology, engineering, and 
mathematics; GPA = grade point average. 


were used to predict adolescents’ future STEM value (see Figure similar to the one for STEM course-taking and suggests that the 
3). The three-way interaction among the intervention, adolescent’s intervention increased adolescents’ future STEM value particu- 
gender, and ninth-grade STEM GPA was significant, as predicted larly for low-GPA boys. The effect of mothers’ STEM UV was 
(z = —2.85, p < .05, B = —.21). The three-way interaction is significant (z = 4.35, p < .01, B = .29); higher levels of mothers’ 
shown in Panel C of Figure 4; the pattern of the interaction is also STEM UV predicted higher levels of adolescents’ future STEM 
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Figure 4. Direct effects of the intervention, child gender, and ninth-grade 
STEM GPA on the hypothesized mediators: (A) mothers’ STEM UV, (B) 
perceptions of parents’ values, and (C) adolescents’ STEM utility value. 
Predicted values were generated for high (1 SD above the mean) and low 
(—1 SD) ninth-grade STEM GPA from the multiple regression models. For 
(A) the range of possible values is 1 to 5. For (B and C) the range of 
possible values is 1 to 7. Error bars represent +1 SEM. STEM = science, 
technology, engineering, and mathematics; GPA = grade point average; 
UV = utility value; SEM = standard error of the mean. 


value. Perceptions of parents’ values was a significant predictor as 
well (z = 5.36, p < .01, 8 = .37); higher levels of perceptions of 
parents’ values predicted higher levels of adolescents’ future 
STEM value. 

Effects on 12th-grade STEM course-taking. The base pre- 
dictors, mothers’ STEM UV, adolescents’ future STEM value, and 
mothers’ years of education were used to predict 12th-grade 
STEM course-taking.* There was a significant effect of adoles- 


cents’ future STEM UV (z = 2.18, p < .05, B = .22). Higher 
levels of adolescents’ future STEM value predicted more 12th- 
grade STEM courses taken, for both boys and girls. 

Indirect effects and mediation. In the structural equation 
model, we hypothesized that the base predictors (specifically the 
three-way interaction) would influence 12th-grade STEM course- 
taking through the mediators, so the direct effects of the base 
predictors to 12th-grade STEM course-taking shown in the direct 
effects model should be reduced in a model containing the medi- 
ators; additionally, there should be significant indirect effects of 
the three-way interaction through the mediators to 12th-grade 
STEM course-taking. We hypothesized that the mediation would 
work in a specific way; that is, the three-way interaction should 
predict mothers’ STEM UV, perceptions of parents’ values, and | 
adolescents” future STEM UV. Mothers’ STEM UV should predict 
perceptions of parents’ values and adolescents’ future STEM UV, 
and perceptions of parents’ values should predict adolescents’ 
future STEM UV. Additionally, we specified that mothers’ STEM 
UV and adolescents’ future STEM value would predict STEM 
course-taking. Using procedures described by Preacher and Hayes 
(2008), we tested the total indirect effect of the intervention 
through the three mediators, as well as the specific indirect effect 
of mothers’ STEM UV through adolescents’ perceptions of par- 
ents’ values and adolescents’ future STEM UV. 

Therefore, two indirect pathways were tested in order to test for 
the indirect effect of the three-way interaction on 12th-grade 
STEM course-taking as well as the indirect effect of mothers’ 
STEM UV on 12th-grade STEM course-taking. For the first, we 
tested whether the three-way interaction had a significant total 
indirect effect on 12th-grade STEM course-taking through the - 
three mediators and found support for this hypothesis (z = —2.40, 
p < .05). Therefore, the intervention, as moderated by adolescent’s 
gender and ninth-grade STEM GPA, had a significant total indirect 
effect on course-taking through the mediating variables: mothers’ 
STEM UV, perceptions of parents’ values, and adolescents’ future 
STEM value. Additionally, the model with the mediators reduced 
the direct effects of the predicted three-way interaction on 12th- 
grade STEM course-taking (direct effect, 8 = —0.18, p < .05; 
with mediators in the model, B = —0.09, ns). 

For the second, we tested for the specific indirect effect of 
mothers’ STEM UV to 12th-grade STEM course-taking through 
perceptions of parents’ values and adolescents’ future STEM 
value. Results indicated a significant specific indirect effect (z = 
2.06, p < .05). This indicated that mothers’ STEM UV had a 
significant specific indirect effect on 12th-grade STEM course- 
taking through perceptions of parents’ values and adolescents’ 
future STEM value. 


Discussion 


To address concerns about low rates of adolescents taking 
advanced STEM courses in high school in the United States, we 
implemented an intervention, based in expectancy-value theory, 
with parents of adolescents (Harackiewicz et al., 2012). In the 
results reported here, we examined whether the intervention was 
differentially effective for girls compared with boys in the context 


* The analyses were repeated using parents’ education instead of moth- 
ers’ education. Findings remained unchanged. 
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of past performance and what factors mediated the effects of the 
intervention on course-taking. In response to the first research 
question, the results from multiple regression analysis indicated 
that the intervention increased STEM course-taking in 12th grade 
for girls who had done well in ninth-grade STEM courses (high 
GPA) and for boys who had not done well (low GPA). However, 
the intervention did not increase course-taking for low-GPA girls 
(trending toward a negative effect), and it had no effect for high- 
GPA boys. The absence of an effect for high-GPA boys is most 
likely due to a ceiling effect on the measure of number of STEM 
courses taken in 12th grade. 

In regard to the second research question, mediation analyses 
suggested that these intervention effects (specifically the three- 
way interaction among the intervention, gender, and prior STEM 
performance) occurred through changes in both mother and ado- 
lescent variables. The intervention was targeted exclusively at 
parents, so we predicted and found that the intervention increased 
mothers’ STEM utility value for their adolescents. The interven- 
tion also led adolescents to perceive higher levels of parental 
STEM values and increased adolescents’ future STEM value, and 
the changes in mothers’ STEM utility value contributed to these 
changes in adolescent variables. Overall, the effect of the inter- 
vention on high-school STEM course-taking was mediated by the 
effects of the intervention on mothers’ STEM utility value and 
adolescents’ STEM utility value. This suggests that parents’ utility 
value does indeed influence adolescents’ utility value and achieve- 
ment behavior. 

Considerable support for Eccles’s expectancy-value theory has 
been amassed through correlational and longitudinal research, but 
experimental support has been lacking. One strength of an exper- 
imental approach to this theory is that researchers can assess the 
causal effect of task values on achievement motivation and behav- 
ior. In particular, when studying families, an association has been 
shown between parents’ beliefs and their children’s beliefs and 
achievement-related behaviors (e.g., Chhin, Bleeker, & Jacobs, 
2008), but the direction of the effect has been unclear. To explore 
whether parents’ values could influence adolescents’ values, we 
experimentally manipulated parents’ utility value through a ran- 
domized intervention to assess the causal impact of parents’ beliefs 
on their children’s beliefs and behaviors (Harackiewicz et al., 
2012). Although the original study showed that an increase in 
adolescents’ STEM course-taking over the final 2 years of high 
school occurred as a result of this intervention, mediation analyses 
of this effect were not conducted. 


Processes Underlying Intervention Effects 


In the current paper, we examined the hypothesis that the 
intervention worked by changing parents’ and adolescents’ STEM 
utility value. We found support for this hypothesis. In our previous 
paper (Harackiewicz et al., 2012), the results indicated that the 
intervention affected mothers’ STEM utility value, which provides 
- crucial support that this utility value intervention for parents had 
its intended effect. In the current analyses, this increase in moth- 
ers’ STEM utility value was related to an increase in adolescents’ 
perceptions of how much their parents valued STEM for them and 
also adolescents’ future STEM value. Thus, both mothers and 
adolescents had increased perceptions of STEM value due to the 
intervention. Because the intervention was targeted exclusively at 


parents, it is reasonable to conclude that adolescents were influ- 
enced by their parents. 

Two paths in Figure 3 warrant additional discussion. First, the 
direct path (specifically the three-way interaction among the in- 
tervention, gender, and prior STEM performance) from the inter- 
vention to adolescents’ perceptions of their parents’ values was 
significant, above and beyond the indirect path through mothers’ 
STEM utility value. That is, the intervention appeared to have 
some effect on adolescents’ perceptions beyond the effect it had on 
mothers’ STEM UV for them. This might involve a process such 
as a mother sharing the intervention website with her adolescent 
while not expressing her beliefs in the value of STEM. Second, the 
direct path from mothers’ STEM UV to adolescents’ future STEM 
value was significant, beyond the indirect effect through adoles- 
cents’ perceptions of their parents’ values. This effect might in- 
volve some changes in mothers’ behavior that are not consciously 
perceived by the adolescent but that nonetheless have an effect. 


Moderation by Gender and Prior Performance 


In this paper we also considered whether the intervention, which 
had an overall positive main effect on course-taking, might be 
differentially effective based on the adolescent’s gender and prior 
STEM performance. The results indicated that, in fact, adoles- 
cents’ prior STEM grades moderated the effect of the utility value 
intervention differently for girls and boys. The intervention had 
positive effects on STEM course-taking for low-GPA boys and 
high-GPA girls, but it had no effect (trending toward a negative 
effect) for low-GPA girls and had no effect for high-GPA boys. 

Why were low-GPA girls not helped by the intervention when 
low-GPA boys were helped by it? The measure of prior perfor- 
mance, ninth-grade STEM GPA, should be linked tightly to both 
mothers’ and adolescents’ expectations for future success in 
STEM and has been used as a proxy for expectations in previous 
utility intervention research (Hulleman et al., 2010). Yet, research 
shows that parents are more likely to have inflated expectancies for 
success for boys in this domain in comparison to girls (Eccles, 
Jacobs, & Harold, 1990; Gunderson, Ramirez, Levine, & Beilock, 
2012; Jacobs, Davis-Kean, Bleeker, Eccles, & Malanchuk, 2005; 
Yee & Eccles, 1988). Thus, parents may assess all boys as capable 
of success in STEM, even if they have had low grades in school. 
Therefore, even low-GPA boys may benefit from a utility-value 
intervention targeted at parents, because parents will still deem 
them capable of succeeding. Boys with higher prior STEM 
achievement did not benefit from the intervention, probably due to 
a ceiling effect in the number of semesters of mathematics and 
science taken during 12th grade. That is, their STEM course-taking 
was constrained by factors such as the number of class periods in 
the day and requirements that they take non-STEM courses. Pos- 
itive effects of the intervention for high-GPA boys might be 
revealed in situations with fewer constraints (e.g., in college). 

For girls, low STEM GPA may create low expectations for 
success—both for the girl and her mother—that negate the bene- 
ficial effects of the UV intervention; even if parents see the value 
of STEM, their low expectations for success for their low-GPA 
daughters mean that parents have low STEM aspirations for them, 
rendering the utility value of STEM irrelevant. These effects are 
consistent with the predicted effects in Eccles’s expectancy-value 
theory. Moreover, they are consistent with past research showing 
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that UV interventions are less effective for those with low expec- 
tations for success (e.g., Durik & Harackiewicz, 2007). 

In addition, girls and their mothers observe the unbalanced 
gender composition of many adult occupations (Ridgeway, 2011), 
which may contribute to the findings. Whereas girls with a high 
STEM GPA may aspire to traditionally masculine careers requir- 
ing substantial mathematics and science and be responsive to the 
intervention, girls with a low STEM GPA may see no reason to 
consider such aspirations and, simultaneously, may be drawn to 
traditionally feminine careers such as child-care worker (95% 
female, Bureau of Labor Statistics, 2011) or elementary- or 
middle-school teacher (82% female), which appear to require little 
mathematics and science (Beilock, Gunderson, Ramirez, & 
Levine, 2010). Moreover, if parents share these beliefs, they may 
not encourage their daughters to pursue STEM careers. This in- 
terpretation is supported by the relatively low level of parental 
valuing of STEM that low-GPA girls reported (see Figure 4, Panel 
B). On balance, then, girls with low prior STEM performance may 
have little interest in STEM courses and careers and receive little 
encouragement from parents, despite the intervention, while simul- 
taneously experiencing a strong pull toward traditionally female 
careers that appear to require little mathematics and science and 
where they feel that they “belong” (Thoman et al., 2013). 

It will be important for future interventions to take into account 
the role of expectancies in designing utility-value interventions 
that will be successful for all students. Recent research has shown 
that, although the interactive effects of expectancy and value are 
mixed, this interaction does occur in some studies (Nagengast et 
al., 2011; Trautwein et al., 2012). This intervention was in the 
STEM domain, so it is likely that both parents’ and adolescents’ 
expectancies would be affected not only by prior achievement but 
also by the adolescent’s gender. Future interventions may be 
strengthened by the inclusion of information that enhances not 
only perceptions of utility value but also expectations for success. 


Limitations and Directions for Future Research 


Several limitations should be kept in mind when interpreting 
these results. First, the sample was representative of the state of 
Wisconsin but not racially diverse, so future research should 
extend these findings to more diverse groups. Previous studies 
have shown that the effects of utility-value interventions are con- 
sistent across racial groups (Hulleman & Harackiewicz, 2009), 
suggesting that our results would extend to more diverse contexts. 
Additionally, although the sample size was sufficiently large to 
have the power to detect the intervention effects, future studies 
would benefit from scaling up the intervention to larger samples. 

Second, although the utility-value intervention affected moth- 
ers’ and adolescents’ perceptions of utility value and adolescents’ 
course-taking behavior, we do not have measures of the precise 
interpersonal processes by which these increases in mothers’ util- 
ity value changed adolescents’ attitudes. Correlational research has 
shown that these effects may be explained through a variety of 
parental behaviors, such as modeling, encouragement, and coact- 
ivity (e.g., Simpkins et al., 2012). Future studies could also assess 
these behaviors to understand how parents’ perceptions of utility 
value result in behavioral change that affects their children. It is 
likely that parents use a variety of methods and behaviors to 
influence their children, so understanding which behaviors are 


most effective will make an important contribution to future re- 
search. We believe that future studies may also benefit from using 
measures of adolescents’ perceptions of their parents’ values as we 
did here, because that measure can capture the effect of a variety 
of parental behaviors. 

Third, this utility-value intervention (and much of the correla- 
tional research based on expectancy-value theory) was conducted 
within a specific domain, STEM. Therefore, we cannot assume 
that these intervention results would generalize to non-STEM 
domains, and future research should extend these findings to other 
domains. Previous research has shown that the relationships be- 
tween utility value and achievement behavior do extend to non- 
STEM domains (e.g., Jodl et al., 2001), so the intervention effects 
should also generalize, but this will need to be tested in future 
studies. 

Finally, although the utility-value intervention had effects that 
differed due to gender and prior achievement, it is important to 
recognize that, on average, this intervention had substantial posi- 
tive effects on STEM course-taking (Harackiewicz et al., 2012). 
Future studies may modify this intervention to make it more 
effective, but it had generally positive effects on a key educational 
outcome needed to enhance STEM preparation. Therefore, we can 
recommend this intervention as having positive effects and also 
recommend taking into account expectancies for success to make 
it more effective in future research. 


Implications 


Several implications flow from these results. The findings indi- 
cate that parents are a resource—a largely untapped one—that may 
be used to enhance STEM motivation of adolescents. There is 
room to increase how much parents value STEM for their adoles- 
cents, and changes in parents’ utility value can affect adolescents’ 
beliefs and behavior. Therefore, parents—in addition to teachers 
and curriculum—may be used to increase students’ STEM prep- 
aration and motivation. Future utility-value interventions should 
also attend to issues of expectations for success, particularly in 
regard to gender gaps in STEM. . 
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Prekindergarten Children’s Executive Functioning Skills and Achievement 
Gains: The Utility of Direct Assessments and Teacher Ratings 
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An accumulating body of evidence suggests that young children who exhibit greater executive function- 
ing (EF) skills in early childhood also achieve more academically. The goal of the present study was to 
examine the unique contributions of direct assessments and teacher ratings of children’s EF skills at the 
beginning of prekindergarten (pre-k) to gains in academic achievement over the pre-k year. Data for the 
current study come from a subsample of children recruited for a large-scale pre-k curriculum interven- 
tion. This subsample (n = 719) was restricted to all children who were native English speakers and had 
at least 1 pretest and posttest score on the assessments. Several important findings emerged. Teacher 
reports of EF and direct assessments were correlated, particularly when EF direct assessments were 
modeled as a single component score. When entered into the models simultaneously, both teacher ratings 
and direct assessments significantly predicted academic gains in literacy and mathematics; however, the 
direct assessments were only marginal in predicting gains in language. EF skills accounted for the largest 
proportion of variance in mathematics achievement gains. The value of using both types of measures in 


future research is discussed. 
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An accumulating body of evidence suggests that young children 
who exhibit greater self-regulation abilities in early childhood 
achieve more academically (e.g., Blair & Razza, 2007; Bodovski 
& Farkas, 2007; Duncan et al., 2007; Li-Grining, Votruba-Drzal, 
Maldonado-Carrefio, & Haas, 2010), have lower rates of hyperac- 
tive and disruptive behaviors (e.g., Espy, Sheffield, Wiebe, Clark, 
& Moehr, 2011; Séguin, Nagin, Assaad, & Tremblay, 2004), and 
are less likely to commit crimes and engage in delinquent behavior 
as adolescents or adults (Moffitt et al., 2011). Within the group of 
studies cited above are ones that capitalized on global ratings of 
children’s self-regulation, including those asking parents and 
teachers to rate children’s self-control, impulsivity, emotion reg- 
ulation, persistence, and attention (e.g., Duncan et al., 2007; Mof- 
fitt et al., 2011). Other researchers, like Blair and Razza (2007), 
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have focused on direct child assessments to capture specific ele- 
ments of children’s self-regulation as they relate to school readi- 
ness and academic achievement. 

In the current study, we focus on a set of skills within the 
domain of self-regulation that is typically referred to as executive 
functioning (EF) or cognitive control skills, including areas such as 
working memory, inhibitory control, and attention flexibility, and 
the contributions that both teacher reports and direct assessments 
of EF make to academic achievement. We examined the associa- 
tions between children’s EF skills and learning in prekindergarten 
(pre-k). Early childhood is a time when not only are children’s EF 
skills showing rapid improvement (see Carlson, 2005; Garon, 
Bryson, & Smith, 2008), but also children are beginning to gain 
exposure to early academic concepts. Hence, this is a key devel- 
opmental period in which to examine more closely the longitudinal 
associations between children’s EF and early academic skills, 
providing findings that could inform assessment and intervention 
efforts in classrooms for young children. 


Executive Functioning and Academic Achievement 


EF skills include working memory (the ability to keep informa- 
tion active in memory and manipulate it), inhibitory control (the 
ability to inhibit salient but irrelevant information in favor of 
relevant information), and attention flexibility (the ability to shift 
and persist in attention; Garon et al., 2008). One or more of these 
EF skills have been positively associated with higher academic 
performance in young children in a variety of studies (e.g., Blair & 
Razza, 2007; Bodovski & Farkas, 2007; Duncan et al., 2007; Fuhs, 
Nesbitt, Farran, & Dong, 2014; Li-Grining et al., 2010). It has been 
hypothesized that stronger EF skills may allow children to meet 
the demands of an early childhood classroom better by facilitating 
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attention, memory for class rules, and engagement in academic 
content, all of which may allow them to benefit from an academic 
environment. EF skills may also work in a more direct way by 
aiding children’s memory for salient information in early mathe- 
matics problem solving or increasing their flexibility in maintain- 
ing both letter sounds and symbols in memory during early literacy 
activities. 

Much of the previous literature examining associations between 
EF skills and academic achievement in early childhood does not 
directly address the extent to which different methodologies for 
assessing EF skills may address the same construct and will relate 
to achievement growth. A growing literature focuses on using one 
or a battery of direct assessments to assess EF, whereas another 
line of research has addressed the associations of teacher reports of 
EF and academic skills development. A few studies have included 
both methodologies. In the following paragraphs, we address prior 
research on each methodology as well as when they have been 
used together before addressing gaps in the literature concerning 
the relative contribution each method makes to our understanding 
of the association between EF and growth in academic skills in 
young children. 


Teacher Ratings of Executive Functioning Skills 


Self-regulation typically serves as an umbrella term that in- 
cludes cognitive and emotional components (Raver et al., 2012), 
associated with concepts in the teacher report literature such as “ap- 
proaches to learning” and “self-control.” Much of what we know 
about associations between self-regulation and academic achievement 
over time and in older children has derived from the use of these more 
global teacher report measures. For example, in the Early Childhood 
Longitudinal Study—Kindergarten, teachers rated children’s ap- 
proaches to learning, which included such behaviors as persistence at 
tasks, eagerness to learn, attention, learning independence, flexibility, 
and organization. Teacher ratings of these characteristics predicted 
mathematics achievement at all grade levels from kindergarten to 
second grade with the strongest effects for those children whose 
achievement fell in the bottom quartile (Bodovski, & Farkas, 
2007). 

Various assessments have been developed to capture self- 
regulation in young children through parent and/or teacher reports. 
For example, the Child Behavior Questionnaire (CBQ; Rothbart, 
Ahadi, Hershey, & Fisher, 2001) is considered a temperament 
scale. Based on Likert-type ratings of young children’s emotional 
and cognitive regulation, the CBQ includes an inhibitory control 
subscale. This measure is typically defined as an assessment of 
effortful control, which includes both cognitive and emotion reg- 
ulation components. Blair and Razza (2007) found that preschool 
teacher reports of effortful control using the CBQ were related to 
children’s kindergarten mathematics and literacy skills. The coef- 
ficients were much weaker for the teacher reports when they were 
compared to direct child assessments of EF in preschool. The CBQ 
includes items reflecting both emotion regulation and cognitive 
regulation, making it difficult to compare directly the contributions 
of teacher reports of cognitive regulation alone and direct child 
assessments of targeted EF skills. In addition, Blair and Razza had 
no preschool measures of achievement; this intriguing study of the 
associations among these areas does not help us understand the 


relationship between EF (assessed in different ways) and learning 
as measured by gains in academic skills across time. 

While the CBQ includes both emotion and cognitive regulation, 
other teacher rating measures focus more specifically on EF skills. 
A clinically oriented measure, the Behavior Rating Inventory of 
Executive Function—Preschool Version (BRIEF-P; Gioia, Isquith, 
Retzlaff, & Espy, 2002), assesses whether children have deficits in 
particular areas of EF. The BRIEF-P has most commonly been 
used as a Clinical and neuropsychological assessment of executive 
dysfunction. Other measures such as the Child Behavior Rating 
Scale (CBRS; Bronson, Tivnan, & Seppanen, 1995) focus more 
specifically on EF skills as they are manifested in typical class- 
room behavior in early childhood. Recent work has linked teacher 
ratings on the CBRS to children’s early academic skills develop- 
ment above and beyond a direct child assessment of behavioral 
regulation (Wanless, McClelland, Acock, Chen, & Chen, 2011). 
Wanless et al. (2011) focused on the links between EF and 
achievement across different cultures. The timing of the teacher 
reports differed such that teachers in the United States did not 
complete teacher ratings until the middle of the school year. Thus, 
it is unclear the extent to which the differing timing of the teacher 
and direct assessments affected their contributions to academic 
achievement. The CBRS was also used in a kindergarten study that 
included a direct assessment of EF (Head, Toes, Knees, and 
Shoulders [HTKS]) as a predictor of achievement (Matthews, 
Ponitz, & Morrison, 2009); this sample was primarily middle- 
income and white. While achievement and HTKS were measured 
at pre- and posttest, teacher CBRS ratings were only collected in 
the spring and only for about 60% of the sample. Nevertheless, on 
this subset of children, both spring teacher ratings and fall HTKS 
scores were related to children’s gains in math achievement over 
the year even when both were in the model. 

A differently constructed measure of EF in the classroom is the 
Work-Related Skills subscale of the Cooper-Farran Behavioral 
Rating Scale (WRS; Cooper & Farran, 1988, 1991). The CBQ, the 
BRIEF-P, and the CBRS are similar to each other in that the items 
in each are rated on a Likert-type scale from, for example, “ex- 
tremely untrue” to “extremely true” or “often” to “never.” The 
WRS is different in that it assesses children’s EF skills in academic 
learning contexts through the use of behaviorally anchored items 
related to classroom expectations. For example, one item lists 
“Listening to Teacher Giving Instructions to Group,” and the 
anchors are “Attends to the teacher without reminders,” “Occa- 
sionally inattentive; attention is easily regained by a cue from 
teacher,” “Can maintain attending behavior with frequent remind- 
ers from the teacher,” and “Seems to ignore the teacher; is very 
distracted and distracting.” This type of scale, in contrast to a 
Likert rating scale, is situationally specific. The “person-situation” 
debate is a robust one in psychology; a trait approach is useful for 
predicting behaviors “averaged over many situations, occasions, 
and responses” (Epstein & O’Brien, 1985, p. 532). In a classroom, 
however, rating scales can suffer from what has been called a 
“reference group” problem (Heine, Lehman, Peng, & Greenholtz, 
2002). While most clearly evident in cross-cultural work, Heine et 
al. (2002) argued that the reference group issue applies for any 
groups that might possibly possess different referents for their 
ratings, such as teachers. One solution Heine et al. proposed, 
although not without limitations, is to create items with concrete, 
objective, response options such as behavioral anchors. 


EXECUTIVE FUNCTIONING SKILLS AND ACHIEVEMENT 209 


Several studies of classrooms in the United States have demon- 
strated an association between teacher reports of young children’s 
EF skills assessed by the WRS and their academic achievement 
(e.g., McClelland, Acock, & Morrison, 2006; McClelland, Morri- 
son, & Holmes, 2000; Speece & Cooper, 1990), but many have 
focused on the relationship in kindergarten. For example, McClel- 
land et al. (2006) found that children’s EF skills, as assessed by the 
WRS subscale, predicted their academic achievement across do- 
mains in kindergarten and continued to predict mathematics and 
literacy achievement out to second grade after controlling for 
covariates, including prior achievement. McClelland et al. (2006) 
also found associations between ratings of EF skills and academic 
achievement out to sixth grade. 

Although the work summarized above points to an association 
between teacher ratings of children’s classroom-related EF and 
academic achievement, at least two specific questions remain. 
First, are these associations apparent prior to kindergarten? Chil- 
dren are increasingly likely to experience academic instruction in 
pre-k classrooms, especially those associated with public schools 
and a learning agenda. Because the previous research with a more 
classroom-specific scale has explored these links primarily in 
kindergarten, it is important to investigate whether teacher reports 
of children’s EF in pre-k classrooms will also capture children’s 
EF skills as they relate to their academic growth. Second, do these 
associations hold even when accounting for children’s perfor- 
mance on direct assessments of EF? Concern is often raised about 
bias in teacher ratings primarily relating to teacher judgments of 
behavior and learning problems (e.g., Berg-Nielsen, Solheim, Bel- 
sky, & Wichstrom, 2012; Mullola et al., 2012). The assumption is 
that these ratings will be less valid and reliable than direct assess- 
ments of behavior. As direct child measures of EF have been 
developed and more information has accrued on their reliability 
and validity, an issue is whether they could substitute for teacher 
ratings and present equal, or perhaps better, indications of chil- 
dren’s learning-related characteristics. 


Direct Child Assessments of Executive Function 


As previously mentioned, one increasingly common approach to 
measuring EF skills in young children is to assess them directly 
with a battery of tasks to tap working memory, inhibitory control, 
and attention flexibility. For the most part, these tasks were de- 
veloped first in psychological laboratories. Researchers have found 
concurrent associations between directly assessed EF skills and 
children’s academic achievement in both literacy and mathematics 
across different grade levels (e.g., Allan & Lonigan, 2011; Best, 
Miller, & Naglieri, 2011; Bull, Espy, Wiebe, Sheffield, & Nelson, 
2011; Bull & Scerif, 2001; St. Clair-Thompson & Gathercole, 
2006). Bull and Scerif (2001) found that several direct assessments 
of pre-k children’s EF were concurrently related to their mathe- 
matics skills (see also Bull et al., 2011). In a cross-sectional study, 
Best et al. (2011) found contemporaneous associations between 
~ individually assessed EF measures and achievement at each grade 
level from age 5 through high school with the strongest, most 
consistent relationship being with mathematics achievement. Fi- 
nally, longitudinal research suggests modest correlations between 
pre-k children’s EF skills and their growth in academic skills as 
well (e.g., Bull, Espy, & Wiebe, 2008; Clark, Pritchard, & Wood- 
ward, 2010; Fuhs et al., 2014; McClelland et al., 2007). 


Gaps in Extant Literature 


While the literature indicates that both direct assessments and 
teacher ratings of children’s EF skills are positively associated 
with children’s academic achievement, several questions remain. 
First, is there a significant association between teacher reports of 
EF and direct child assessments in a pre-k sample? McClelland et 
al. (2007) examined correlations between HTKS as a direct as- 
sessment of EF and teacher ratings of children’s social and behav- 
ioral regulation, but this work has not been extended to the WRS 
and a broader range of EF direct child assessments. 

Second, what are the benefits and unique contributions of these 
two methods of assessment? A recent review of performance- 
based measures and ratings of EF found only modest correlations 
between them when each was used in the same study (Toplak, 
West, & Stanovich, 2013). The authors concluded that the two 
types of measures were actually tapping different cognitive levels 
in the respondent. Direct assessments they argued provide evi- 
dence of the individual’s available processes, while ratings provide 
evidence of how those processes may or may not be used in an 
actual setting. Also, direct assessments provide an understanding 
of children’s EF skills in a controlled or neutral context because 
they are administered to children individually usually in a quiet 
space. It is not clear what relationship performance in the con- 
trolled setting will have with children’s EF skills in an ecological 
context like a classroom. On the other hand, while teacher ratings 
of children’s behaviors can be particularly beneficial to understand 
children’s EF skills in authentic classroom environments, ratings 
could be influenced by other aspects of children’s abilities and 
skills besides EF. 

Using both types of assessments together in a multimethod 
approach could yield a more complete understanding of children’s 
EF skills, “providing important and nonredundant information 
about an individual’s efficiency and success in achieving goals” 
(Toplak et al., 2013, p. 138). Moreover, an understanding of the 
relation between the two methods of assessing EF is informative 
for research because (a) utilizing direct child assessments is not 
always a feasible option for researchers, and (b) teacher reports 
may not be appropriate as the only means of assessing children’s 
abilities (e.g., in curriculum interventions that focus specifically on 
developing self regulation). 


Current Study 


The goal of the present study is to examine the contribution of 
direct assessments and teacher ratings of children’s EF skills at the 
beginning of pre-k to predict children’s gains in academic achieve- 
ment over the pre-k year. Four research questions were examined: 
(a) Is children’s performance on direct assessments of EF posi- 
tively correlated with teachers’ ratings of their EF skills in the 
classroom context? (b) Are teacher ratings of children’s EF skills 
at the beginning of pre-k associated with the gains children make 
in literacy, language, and mathematics across the pre-k year? (c) Is 
children’s performance on direct assessments of EF at the begin- 
ning of pre-k associated with the gains they make in literacy, 
language, and mathematics across the pre-k year? (d) Are direct 
assessments and teacher ratings of EF, when examined together in 
a single model, significantly and uniquely associated with chil- 
dren’s literacy, language, and mathematics gains? 
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Method 


Participants 


Data for the current study are a subsample from a larger sample 
of children (V = 1,145) recruited for a large-scale randomized 
control trial (RCT) to evaluate the effectiveness of the Tools of the 
Mind curriculum (Bodrova & Leong, 2007; Farran, Wilson, & 
Lipsey, 2013). All assessments were administered in English; 
therefore, we removed nonnative English speakers (n = 380) from 
the current sample to eliminate confounds due to limited English 
proficiency. To be consistent with the analytic sample of the RCT, 
we also removed children who did not have at least one pre- or 
posttest (primarily due to moving prior to the spring assessments). 
The analytic sample for the current study, therefore, consisted of 
719 English-speaking pre-k students who had at least one complete 
pretest and posttest measure. There were 695 children (M,,.= 54 
months; SD,,.= 4 months) with complete demographic, child 
assessment, and teacher report data for this study. Children with 
missing data points did not differ from the analytic sample on any 
demographic or pretest measure (p > .05), and because the cases 
with missing data constituted less than 5% of the analytic sample, 
we only used available data for each analysis rather than conduct- 
ing multiple imputation. Girls were 46% of the sample, and chil- 
dren came from varied racial/ethnic backgrounds (36% Black, 
52% White, 5% Hispanic, and 7% other).’ Sixteen percent of the 
sample had an individualized education program (IEP), which was 
included in analyses as a covariate. Although precise socioeco- 
nomic status (SES) information was not available due to Family 
Educational Rights and Privacy Act regulations, all children in this 
study came from public pre-k programs targeted to low-income 
families. Therefore, it can be assumed that most, if not all, children 
in the study were from low-income backgrounds. 

Children in the sample were nested in 80 classrooms in 57 
schools in six school systems in the Southeastern United States. On 
average, nine children from each classroom were in the analysis 
sample. The average number of years of experience teaching pre-k 
for the teachers in the study was six. Because these data were 
drawn from an RCT, 32 classrooms were assigned to the Tools of 


the Mind condition and 28 classrooms were assigned to “business . 


as usual,” which involved a variety of curricula, but primarily 
Creative Curriculum, Opening the World of Learning, and Build- 
ing Blocks. All analyses of main effects for curriculum on indi- 
vidual academic achievement measures and possible interactions 
between curriculum and demographic characteristics or children’s 
pretests were nonsignificant in the RCT (Farran et al., 2013). 
Nonetheless, we included condition as a control variable in all of 
the present analyses as the experimental curriculum could poten- 
tially account for variance in academic achievement and EF in this 
sample. 


Measures 


Teacher reports. Children’s classroom-specific EF skills as 
well as their more general social skills were assessed using the 
Cooper-Farran Behavioral Rating Scale (CFBRS; Cooper & Far- 
ran, 1988, 1991). The CRBRS is an assessment of young chil- 
dren’s behaviors at school entry and consists of two subscales: 
Work-Related Skills (WRS) and Interpersonal Skills (IPS). The 


WRS subscale rates children’s EF skills as they are manifested in 
the classroom. The WRS consists of 16 items related to children’s 
independent work, compliance and memory for instructions, and 
ability to complete tasks. The IPS subscale rates children’s social 
skills. The IPS consists of 21 items related to children’s ability to 
engage effectively in interactions with peers and teachers. All 
CFBRS are items are rated from 1 to 7 using behavioral anchors 
distinctive to each odd-numbered item. As reported in the manual 
(Cooper & Farran, 1991), the test-retest reliability after an 8-week 
delay for the WRS and IPS subscales were .66 and .69, respec- 
tively. Interrater reliability on the measure has also been estab- 
lished between teacher and teacher aids; intraclass correlations 
between the raters were .79 and .78 for the WRS and IPS sub- 
scales. The two subscales also indicate high internal consistency 
(Chronbach’s a > .94). 

Direct child assessments of EF. In the current study, EF was 
assessed using multiple measures that were chosen to cover the 
range of EF skills discussed in early childhood literature (see 
Garon et al., 2008). We included working memory, inhibitory 
control, and attention flexibility tasks (for additional information on 
each task, see https://my.vanderbilt.edu/toolsofthemindevaluation/), 
although each task naturally also tapped other abilities such as 
motor skills or language ability. This has commonly been called 
the “task impurity” problem as EF tasks not only tap non-EF skills 
but also typically tap more than one EF skill (Miyake, Friedman, 
Emerson, Witzki, & Howerter, 2000). To account for this, we were 
consistent with the methodology of prior research in this area and 
used a battery of EF tasks to create a composite score, drawing on 
the common EF variance shared by individual tasks (e.g., Wiebe, 
Espy, & Charak, 2008). Previous research with pre-k children 
appears to support a one-factor model of EF (e.g., Fuhs & Day, 
2011; Hughes & Ensor, 2011; Wiebe et al., 2008; although see 
Miller, Giesbrecht, Muller, McInerney, & Kerns, 2012). We did, 
however, analyze the EF tasks both as individual tasks as well as 
a composite score to more fully capture their contributions to 
academic skills. 

Visuo-spatial short-term memory and working memory were 
assessed using the Corsi Blocks task (Berch, Krikorian, & Huha, 
1998; Corsi, 1972). We chose a visuospatial task instead of a 
verbal task because previous research has suggested that young 
children have great difficulty with verbal working memory tasks 
such as digit span (e.g., Bull et al., 2008). We also were concerned 
about the potential confounds of using a digit span task to predict 
mathematics skills because this task also taps digit familiarity. 
Corsi Blocks required children to point to a series of block patterns 
(block placement modeled after Berch et al., 1998) that became 
progressively more difficult by increasing the number of points in 
the pattern. Children were asked to repeat a pattern exactly as 
presented (Forward) and then to reverse a presented pattern (Back- 
ward). The experimenter tapped the blocks at a rate of approxi- 
mately one block per second, and children received up to two 
attempts to successfully complete each span length until they 
scored incorrectly on both trials for a particular span length. Two 


' Children’s ethnicity was provided to us by schools from parent reports 
the schools collected. However, the reliability of this self-reported infor- 
mation was unclear, and thus, this variable was not used in analyses. 
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blocks of forward trials were conducted, followed by two blocks of 
backward trials. 

Children received two practice trials prior to assessments in 
which the child only had to touch the same block as an experi- 
menter to ensure that the child could perform the basic action 
required of the task. Then, the child received additional practice 
trials that were identical to test trials but that were followed by 
feedback. Following practice, the test trials without feedback be- 
gan. Some have referred to the forward component of the task as 
a simple working memory task and the backward component as a 
complex working memory task (e.g., Garon et al., 2008), whereas 
others have referred to the forward component as a passive work- 
ing memory tasks and the backward component as an active 
working memory task (e.g., Passolunghi & Cornoldi, 2008). 
Across interpretations, however, Corsi Blocks has been conceptu- 
alized as a working memory task and has been found to be 
significantly associated with academic skills in older children and 
adults, and particularly mathematics skills (see Raghubar, Barnes, 
& Hecht, 2010 for a review). The forward and backward versions 
of the task have shown high test-retest reliability in children ages 
4 to 11 years (r, = .83 and .82; Alloway, Gathercole, & Pickering, 
2006). 

Attention flexibility was measured with the Dimensional 
Change Card Sort (DCCS; Zelazo, 2006). The DCCS required 
children first to sort a set of cards according to one dimension 
(color) and then according to another (shape). Children were 
presented with two boxes, one with a red truck on it, and one with 
a blue star. Each box had a slit in the top for children to sort cards 
(e.g., blue trucks and red stars). The experimenter first demon- 
strated the color game on two trials and then conducted a rule 
check with the child (“Can you show me where the blue ones go 
in the color game?” and “Can you show me where the red ones go 
in the color game?”). Children received up to two trials for each 
rule check. Children were then instructed on each trial, “If it is a 
blue one, then put it here [pointing to blue star], but if it is a red 
one, put it here [pointing to red truck].” Children were given six 
trials and if they completed at least five of six correct, they moved 
on to the shape game. In the shape game, the rules were given 
without demonstration with cards, but children still had two op- 
portunities to correctly complete the rule check. The same pass/fail 
criteria were used for both the color and shape sorts. If children 
completed both of these games successfully, children moved onto 
the advanced sort. In this game, children were asked to sort by 
color if they were presented with a card with a border around it and 
to sort by shape if the card had no border. Following both dem- 
onstration trials and rule checks, children completed 12 advanced 
sort trials, with nine out of 12 correct counted as passing. 

Again, as with the other games, the rules were repeated on each 
trial. Using scoring suggested by Zelazo (2006), children received 
ascore of 0 if they were unable to successfully sort five of six trials 
on the first dimension of color, a 1 if they sorted by the first 
dimension but did not meet the five-out-of-six cutoff criterion for 
sorting by the dimension of shape, a 2 if they were successful 
sorting by shape, and a 3 if they also passed the border sort (i.e., 
correct on at least nine of the 12 trials). Performance on DCCS has 
shown moderate test-retest reliability with children 36 to 72 
months (r = .44; Miiller, Kerns, & Konkin, 2012). Larger scores 
on the DCCS were interpreted as indicating greater attention 
flexibility. 


Copy Design (Osborne, Butler, & Morris, 1984) required chil- 
dren to copy eight simple geometric shapes of increasing diffi- 
culty. Tasks of this type are drawing increasing attention (Cameron 
et al., 2012; Potter, Mashburn, & Grissmer, 2013). Cameron et al. 
(2012) described the task as requiring children “to process visual 
information from an external stimulus, invoke a mental represen- 
tation, and coordinate motor movements to reproduce the image” 
(p. 1240). Children had two attempts to successfully draw each 
shape and each attempt was coded to indicate whether the child 
successfully replicated a design. 

This task was used in the British Longitudinal Study analyzed 
by Duncan et al. (2007) and was one of the stronger long-term 
predictors of child outcomes. It was also recently used in a large- 
scale measurement study of children’s cognitive self-regulation 
development (Lipsey et al., 2014). In this longitudinal measure- 
ment study, Copy Design was significantly correlated with a bat- 
tery of cognitive self-regulation assessments and showed both 
construct validity via confirmatory factor analysis (Fuhs & Turner, 
2012) and predictive validity for academic achievement (Lipsey et 
al., 2014). Interrater reliability for Copy Design in this study was 
established by two independent raters double coding 20% of the 
measures. The kappa coefficients for the 8 shapes ranged from .66 
to 1.00 (Myappa = -79). Each shape was scored 0 if coded as 
incorrect and 1 if coded as correct. Larger scores on the Copy 
Design were interpreted as indicating greater sustained attention. 

Inhibitory control was assessed with Peg Tapping (PT; Dia- 
mond & Taylor, 1996) and Head Toes Knees Shoulders (HTKS; 
Ponitz, McClelland, Matthews, & Morrison, 2009). PT requires 
children to tap a peg once when an examiner taps it twice and to 
tap twice when an examiner taps once. Children completed 16 test 
trials that were scored 0 for incorrect responses and 1 for correct. 
If the child could not successfully complete practice trials on PT, 
he or she scored —1 and did not complete the test trials. Perfor- 
mance on PT has shown high test-retest reliability in 5-year-olds 
(r = .74; Nampijja et al., 2010). 

HTKS is another task primarily assessing inhibitory control, 
although the task likely also taps children’s working memory as 
directions are not repeated on each trial, attention shifting as the 
rules change during the game, and gross motor coordination. 
Children were asked to touch their heads when an examiner says 
“touch your toes” and to touch their toes when an examiner says 
“touch your head.” If children were successful at inhibiting the 
prepotent response of behaving consistently with the prompt then 
two new prompts are added, children were then required to touch 
their knees when an examiner says “touch your shoulders” and 
vice versa. Children received up to a total of six practice trials and 
20 test trials, and each trial was scored with O for an incorrect 
response, | for motion toward the incorrect response but ending 
with a correct response, and 2 for a correct response. Performance 
on HTKS has shown high test-retest reliability in 4-year-olds (r = 
.80; Meador, Turner, Lipsey, & Farran, 2013). Larger scores on 
both PT and HTKS indicated greater ability to inhibit a prepotent 
response. 

Academic achievement. Academic achievement was as- 
sessed by administering seven subscales of the Woodcock Johnson 
III achievement battery (WJ-II]; Woodcock, McGrew, Mather, 
2001). Literacy skills were assessed with the Letter-Word Identi- 
fication and Spelling subtests. Letter-Word Identification measures 
children’s ability to identify and pronounce alphabet letters and 
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read words, while Spelling measures children’s ability to draw 
simple shapes and write orally presented letters and words. Language 
skills were assessed with the Academic Knowledge, Oral Compre- 
hension, and Picture Vocabulary subtests. Academic Knowledge tests 
children’s factual knowledge of science, social studies, and the 
humanities; for young children the subtest mainly consists of 
labeling and identifying pictures, thus heavily relying on vocabu- 
lary. Oral Comprehension asks children to complete an orally 
presented passage by providing the appropriate missing word on 
the basis of semantic and syntactic cues. Picture Vocabulary asks 
children to name objects presented in pictures; it is a test of nouns 
or knowing the names of things. Mathematics skills were assessed 
with the Applied Problems and Quantitative Concepts subtests. 
Applied Problems measures children’s ability to solve numerical 
and spatial problems accompanied by pictures while Quantitative 
Concepts measures children’s understanding of number identifica- 
tion, sequencing, shapes, and symbols and in a separate section to 
manipulate the number line. All analyses were conducted using the 
item response theory-scaled W-Scores. Standard scores normed 
with a mean of 100 (SD = 15) can be more interpretable for 
descriptive purposes and are therefore presented in addition to the 
W score means in Table 1. 


pleted direct assessments. All direct assessments of children were 
conducted in a quiet area of the building in which they had their 
prekindergarten program. Data from children’s EF assessments in 
the fall and\their academic achievement in both fall and spring 
were used in analyses. Assessments were administered in a fixed 
order at each time point. In one child testing session, children 
completed PT, HTKS, and Copy Design, followed by the WJ-III 
Oral Comprehension, Applied Problems, Quantitative Concepts, 
and Picture Vocabulary subtests. The other testing session con- 
sisted of the DCCS and Corsi Blocks, followed by the WJ-IIl 
Letter-Word Identification, Academic Knowledge, and Spelling 
subtests. The average interval between fall and spring sessions was 
7.38 months (SD = 0.55 months). 


Analytic Approach 


A series of multilevel models (children ‘nested within class- 
rooms, schools, and systems) was conducted to examine the asso- 
ciations between children’s EF and academic achievement gains, 
using different methodologies to assess EF (teacher report and 
behavioral assessment). All predictors of interest and outcomes 
were included as standardized variables so that the parameter 


estimates could be compared across models. Prior to running 
conditional models, we first ran a fully unconditional model to 
determine the percentage of variance in academic achievement 
outcomes accounted for by the classroom, school, and system 
levels of our model. We then proceeded to run conditional models 
to test the associations between EF direct assessments and teacher 


Procedure 


Teacher reports were completed in the fall after children had 
acclimated to the classrooms, about 4—6 weeks. The ratings were 
collected close to the same time period in which children com- 


Table 1 
Descriptive Statistics 
eee tc AS eee eet 
W score 
Variable N M SD M SD Skew t 
Corsi Forward T1 my, 2.52 1.20 —0.61 
Corsi Backward T1 716 ee? eS 0.24 
DCCS Tl 717 1.38 0.62 —0.14 
Copy Design T1 717 0.91 1.44 2.30 
HTKS T1 iki) 11.56 i371 de 
Peg Tapping T1 Tu, Sys) 5.86 0.37 
Work-Related Skills T1 716 4.60 V13 —0:29 
Interpersonal Skills T1 716 5.18 1.06 —0.81 
Letter Word T1 716 93.23 12.21 318.66 24.57 0.09 
Letter Word T2 700 100.13 11.12 347.91 22.38 =0:23 Bye 
Spelling T1 716 79.15 DES: 336.40 DoD, —0.14 
Spelling T2 700 86.20 15.04 369.31 26.94 —0.26 36.24 
Academic Knowledge T1 716 91.31 12.61 436.21 15.82 —0.62 
Academic Knowledge T2 700 97.15 ST 449.04 13.34 —0.66 29.20°* 
Oral Comprehension T1 717 94.55 AES 445.39 13.42 —0.18 
Oral Comprehension T2 703 99.03 11.65 456.33 13.13 SOUS DIAS; 
Picture Vocabulary T1 TAG) 100.67 11.40 462.09 12.78 SOY 
Picture Vocabulary T2 703 101.03 9.63 468.77 9.93 —2.47 16.93™* 
Applied Problems T1 717 96.69 253 390.27 25.10 Oil 
Applied Problems T2 703 100.45 HED 411.54 18.77 —0.88 29.90™* 
Quantitative Concepts T1 ali 87.93 10.69 406.13 12.43 0.68 
Quantitative Concepts T2 703 92.97 12.87 422.50 14.60 —0.03 40.46™" 





Note. T1/T2 = Time 1/Time 2; DCCS = Dimensional Change Card Sort (Zelazo, 2006); HTKS = Head Toes Knees Shoulders (Ponitz, McClelland, 
Matthews, & Morrison, 2009). ¢ statistics reported are from independent-samples tf tests comparing assessment performance at T2 to T1. For Woodcock 
Johnson III achievement battery (WJ-III; Woodcock, McGrew, Mather, 2001), standard scores are reported, but W scores are also reported as they were 
used for tests of skewness and longitudinal analyses including ¢ tests. Cooper-Farran Behavioral Rating Scale (CFBRS; Cooper & Farran, 1988, 1991) 
Work-Related Skills and Interpersonal Skills scores were based on average ratings on a 7-point scale. 
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reports and gains in academic skills in literacy, language, and 
mathematics. Our predictors of interest were entered as fixed 
effects. A number of covariates were also entered as fixed effects 
at the child level including pretest scores, pre-post testing interval, 
age at pretest, gender, and IEP status. Because these data were 
taken from a large-scale RCT, condition was included as a fixed 
effect at the school level, although our primary interest was not in 
evaluating condition but rather in accounting for potential variance 
in outcomes due to variations in curricula used in classrooms. All 
multilevel models were run in IBM SPSS (Version 20 Mixed 
Models) using restricted maximum-likelihood estimation. A sam- 
ple model equation for the EF direct assessments and teacher 
reports entered simultaneously to predict language outcomes is 
. presented in the Appendix. 


Results 


Descriptive Statistics 


Descriptive statistics for all variables are presented in Table 1. 
EF direct assessment scores and teacher ratings from the fall are 
presented in the top of the table; it is clear that for all the direct 
assessments variation among the children was great. Children 
entered pre-k with quite different EF skills. As indicated by the t 
tests in Table 1, children significantly improved their academic 
skills, with children making particularly large gains on W scores in 
W/J-ILI Letter-Word Identification, Spelling, and Quantitative Con- 
cepts subtests. We also examined each variable for evidence of 
nonnormality; both Copy Design and Picture Vocabulary showed 
evidence of skewness (see Table 1). Copy Design was positively 
skewed, and Picture Vocabulary was negatively skewed at both 
time points. Therefore, we performed log transformations of these 
variables before entering them in analyses. The log transformation 
reduced the skewness of Copy Design to .958, the skewness of 
Picture Vocabulary at Time 1 to —1.464, and the skewness of 
Picture Vocabulary at Time 2 to —1.057. Transtormed variables 
were used in all subsequent analyses. 


Data Reduction for EF and Achievement 


Due to the nature and complexity of direct assessments of EF in 
young children, we are presenting results of our analytic models in 
two ways: (a) for the individual EF tasks entered simultaneously 
and (b) for the component EF score. We used principal component 
analysis (PCA) to extract common variance among the EF tasks 
(Corsi Blocks, DCCS, Copy Design, HTKS, and Peg Tapping) and 
saved a component score for one set of analyses. Using eigenval- 
ues of >1 as the criterion to determine the number of components, 
a one-component PCA solution for EF at Time | accounted for 
41.58% of the variance in the assessments. Component loadings 
for the EF measures were all above .50. We saved the component 
scores as a variable, which was standardized with a mean of 0 and 
“a standard deviation of 1 and used in one set of analyses. 

PCAs were also conducted to reduce redundancy in the mea- 
surement of academic achievement and to ensure that high corre- 
lations among the subtests would not jeopardize model specificity. 
We saved component scores for literacy (WJ-III Letter-Word 
Identification and Spelling), language (WJ-III Academic Knowl- 
edge, Oral Comprehension, and Picture Vocabulary), and mathe- 


matics (WJ-III Applied Problems and Quantitative Concepts) at 
each time point. For literacy, we entered the WJ-III Letter-Word 
Identification and Spelling subtests into a PCA analysis, and we 
found a one-factor solution that accounted for 71.21% of the 
variance in the measures at Time 1. Again, for this and all other 
component scores, we saved these scores as variables and used 
them in analyses. At Time 2, the one-factor PCA solution for 
literacy with the same two measures accounted for 77.81% of the 
variance in the assessments. For language, the component score 
explained 72.24% of the variance at Time 1 and 72.53% of the 
variance at Time 2. The mathematics PCA produced a component 
that explained 82.16% of the variance in the assessments at Time 
1 and 83.54% of the variance at Time 2. Within each PCA, the 
loadings of each of the measures onto the component were all 
above .80. 


Correlations 


Zero-order correlations among demographics, teacher ratings, 
the EF component score, individual EF direct assessments, and the 
achievement composites for both fall and spring are presented in 
Table 2. All academic achievement component scores, teacher 
ratings, and EF scores were moderately to strongly correlated with 
each other. Particularly strong correlations emerged between the 
EF direct assessments and the mathematics composite score at 
both time points. The EF direct assessment composite and teacher 
reports of EF (WRS) were strongly correlated, suggesting they are 
tapping a similar underlying construct. Interestingly the correla- 
tions between the EF composite score and entering achievement 
were somewhat higher than between teacher WRS ratings and 
entering academic skills, suggesting teachers were rating observed 
classroom behavior and not just children’s skill levels. The corre- 
lations were notably weaker between teacher reports of social 
skills (IPS) and both direct assessments of EF and academic 
achievement. 


Unconditional Multilevel Models 


We first ran unconditional models to determine the percentage 
of variance in academic achievement gains in literacy, language, 
and mathematics that could be accounted for by the nesting levels 
of classroom, school, and system. It was especially important to 
establish the percentage of variance in academic achievement 
gains that could be accounted for by child-level differences as our 
predictors of interest were at the child level. If we found, for 
example, that the largest percentage of variance in achievement 
outcomes was at the classroom level, we would have little variance 
left to be explained by child-level predictors. Because our focus 
was on explaining children’s pre-k gains in academic achievement, 
unconditional models included the covariates of gender, age, ex- 
perimental condition, IEP status, interval of time that elapsed 
between pre- and posttest, and pretest achievement scores. 

Based on the random parameters reported in the first columns of 
Tables 3, 4, and 5 (labeled “Unconditional Model’), 89.3% of 
variance in literacy outcomes, 94.1% of variance in language 
outcomes, and 92.7% of variance in mathematics outcomes (note 
that values were rounded) was attributed to child-level differences 
and could be modeled with our child-level predictors of interest, 
namely, EF direct assessments and teacher reports. Although the 
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Table 3 
EF Direct Assessments and Teacher Reports Predict End of Prekindergarten Literacy Skills 
Unconditional 
model Model 1 Model 2 Model 3 Model 4 Model 5 
Variable B SE B SE SE B SE B SE B SE 


Fixed Parameters 


Intercept 0.01 0.08 0.00 0.09 0.00 0.09 —0.01 0.08 —0.01 0.09 —0.02 0.09 
Gender Ose 0:035 =042+ 0.03 One 0.03  —0.09** 0.03 —0.09** 0.03 —0.09** 0.03 
Age —0.04 01035 - —0:08" 0.03 =0:07% 0.03 —0.06" 0.03 —0.08** 0.03 =(O07" 0.03 
Curriculum Condition —0.04 0.07  —0.05 0.07 —0.04 O0O78 —=0:01 0.07 —0.02 0.07 —0.01 0.07 
IEP Status —0.04 0.03  —0.02 0.03 —0.02 0.03 —0.01 0.03 0.00 0.03 (OOK 0.03 
Pre-Post Interval 0102 0.03  —0.03 0.04 —0.03 0.04 —0.04 0.04 —0.04 0.04 —0.04 0.04 
Pretest 0.66** 0.03 O5TS 0.03 0.59** 0.03 0.60** 0.03 0.54** 0.03 O77 4 0.03 
Corsi Forward 0.09** 0.03 0.07" 0.03 

Corsi Backward 0.01 0.03 —0.01 0.03 

DCCS 0.04 0.03 0.03 0.03 

HTKS —0.02 0.03 —0.03 0.03 

Peg Tapping 0.07" 0.04 0.05 0.04 

Copy Design 0.09** 0.03 0.09%* 0.03 

EF Composite Score 0.16** 0.04 0.10°* 0.04 
EF Teacher Report 0.18** 0.03 OSS 0:03 01552 gO103 

Random Parameters 

Child 0.49** 0.03 0.47** 0.03 0.48** 0.03 Oe 0.03 0.46** 0.03 0.46°* 0.03 
Classroom 0.03 0.02 0.02 0.03 0.03 0.04 0.03 0.04 0.02 0.04 0.03 0.04 
School 0.00 0.00 0.01 0.04 0.00 0.04 0.01 0.05 0.01 0.04 0.01 0.05 
System 0.02 0.02 0.03 0.03 0.03 0.03 0.02 0.02 0.02 0.03 0.03 0.03 
Psuedo-R? 0.04 0.03 0.05 0.06 0.06 


Note. 


EF = executive functioning; IEP = individualized education plan; DCCS = Dimensional Change Card Sort (Zelazo, 2006); HTKS = Head Toes 


Knees Shoulders (Ponitz, McClelland, Matthews, & Morrison, 2009). Pseudo-R? estimates indicate the amount of within-child variability in end of 
prekindergarten literacy skills explained by the addition of EF measures to the unconditional model. 
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percentage of variance accounted for by the different levels of the 
model varied across academic content areas we decided to account 
for all levels in all academic content area analytic models to 
maintain consistency across models and to aid in comparison of 
effects. 


Conditional Multilevel Models 


In Tables 3—5, we present the associations between children’s 
fall EF scores and their spring academic achievement in the areas 
of literacy, language and mathematics after controlling for demo- 
graphic covariates and children’s academic achievement in the fall, 
in other words the gain in achievement related to initial EF scores. 
Each table focuses on a different academic area and each contains 
five models. The first model in each table is the model with the EF 
direct assessments entered individually into the model to predict 
gains in academic achievement. The second model (Model 2), 
shows results when the EF composite score was entered alone. 
Model 3 in each table depicts the results when the WRS ratings are 
entered; Model 4 shows the individual direct assessment scores 
with the addition of the teacher ratings. Finally in each table, 
Model 5 examines the joint contribution of the EF direct assess- 
ment composite and the teacher ratings of EF in predicting gains in 
achievement across the year. 

Multilevel models for EF direct assessments. First we dis- 
cuss the results from Models 1 and 2 examining the effects for the 
individual direct assessments entered simultaneously versus the 
composite score alone. For literacy gains (Table 3), Corsi For- 
ward, Peg Tapping, and Copy Design were significant or marginal 


predictors. For language gains shown in Table 4, none of the 
individual EF measures significantly predicted growth except for a 
marginal effect for Corsi Backward. The individual significant EF 
predictors of mathematics gains as shown in Table 5 were Corsi 
Forward, Peg Tapping, Copy Design and DCCS. HTKS did not 
predict any achievement content area gains. When examining 
variance accounted for by adding the group of EF assessments to 
the conditional model, the pseudo-R? estimate of effect size was 
largest for the mathematics content area. 

Model 2 in Tables 3-5 presents the results when EF skills are 
entered as a PCA composite score. The composite EF score was 
predictive of children’s literacy, language, and mathematics skills 
in the spring with fall pretest scores entered as a covariate. Again, 
the pseudo-R? estimate of effect size was the largest for mathe- 
matics. Thus, Models 1 and 2 indicate that direct assessments of 
EF both individually and as a composite related to achievement 
gains over and above children’s entering skill levels, but the 
magnitude of effects varied both by individual EF assessment and 
by academic content area. 

Multilevel models for EF teacher reports. Model 3 tested 
the association between teachers’ ratings of children’s EF and 
spring academic achievement in the areas of literacy, language, 
and mathematics after controlling for demographic covariates and 
children’s academic achievement in the fall. After accounting for 
covariates and academic pretest skills, Model 3 in each of Tables 
3-5 demonstrates that teachers’ fall reports of EF significantly 
predicted literacy, language, and mathematics outcomes, with the 
pseudo-R? estimate of effects size for literacy achievement being 
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Table 4 
EF Direct Assessments and Teacher Reports Predict End of Prekindergarten Language Skills 
Unconditional 
model Model 1 Model 2 ‘ Model 3 Model 4 Model 5 
a 1S Ae ee ee 
Fixed Parameters 
Intercept 0.07 0.04 0.07 0.04 0.077 0.04 0.06 0.04 0.06 0.04 0.06 0.04 
Gender 0.02 0.02 0.03 0.02 0.03 0.02 0.05* 0.02 0.05" 0.02 0.05* 0.02 
Age —0.03 0.02 —0.04 0.02 —0.05 0.03 —0.04" 0.02 —0.04" 0.02 —0.05* 0.02 
Curriculum Condition =o OMIT 0.05 —0.11* 0.05 = (el 0.06 —0.09* 0.05 —0.10" 0.05 —0.107 0.05 
IEP Status —0.03 0.02 —0.02 0.02 —0.02 0.02 —0.01 0.02 —0.01 0.02 —0.01 0.02 
Pre-Post Interval 0.07** 0.03 0.06* 0.03 0.06* 0.03 0.06* 0.02 0.06" 0.03 0.06* 0.03 
Pretest 0.80" 0.02 O52 0.03 0.75™* 0.03 O75 0.03 O78 0.03 (0-738 0.03 
Corsi Forward 0.03 0.03 0.01 0.03 
Corsi Backward 0.04" 0.02 0.03 0.02 
DCCS 0.03 0.02 0.03 0.02 
HTKS 0.01 0.03 0.01 0.03 
Peg Tapping 0.03 0.03 0.02 0.03 
Copy Design —0.01 0.02 —0.01 0.02 
EF Composite Score 0109's 0103) .05* 0.03 
EF Teacher Report 0.10 0.04 0.10™ 0.03 NOs 0103 
Random Parameters 
Child 032 0.02 O39 0.02 032 0.02 OS itee 0.02 Ose 0.02 0B 2m 0.02 
Classroom 0.027 0.01 0.02* 0.01 0.02" 0.01 0.01 0.01 0.017 0.01 0.017 0.01 
School 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 
System 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 
Psuedo-R 0.01 0.01 0.02 0.02 0.02 








Note. EF = executive functioning; IEP = individualized education plan; DCCS = Dimensional Change Card Sort (Zelazo, 2006); HTKS = Head Toes 
Knees Shoulders (Ponitz, McClelland, Matthews, & Morrison, 2009). Psuedo-R? estimates indicate the amount of within-child variability in end of 
prekindegarten language skills explained by the addition of EF measures to the unconditional model. 
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the largest. Therefore, although EF direct assessments accounted To summarize, Models 4 and 5 indicate that both EF methods 
for the most variance in mathematics achievement gains, EF (teacher reports and direct assessments, whether entered individ- 
teacher reports accounted for the most variance in literacy achieve- ually or as a composite) were significant predictors of children’s 
ment gains. achievement outcomes in language, literacy, and mathematics. The 


Multilevel models for EF direct assessments and teacher standardized coefficients for teacher reports were larger for lan- 
reports together. Finally, we assessed the unique contributions guage and literacy, whereas the coefficients for direct assessments 
of direct assessments of EF, individually and as a composite, and were larger for mathematics. The magnitude of the EF effect when 


teacher reports of EF when entered into a model simultaneously both assessment types were entered simultaneously was largest for 
(see Tables 3-5, Models 4 and 5). We again ran two separate mathematics outcomes. 

models, one in which EF direct assessments were entered individ- Multilevel models for interpersonal skills. Despite the fact 
ually with the teacher ratings and one in which the EF direct that our analyses related to gains in achievement by including 
assessments composite score was entered with teacher ratings but pretest in the models, one could question whether teacher ratings 
without the individual EF assessments. After controlling for cova- reflected a general favorable bias toward higher achieving chil- 


riates and academic pretests, teacher reports remained significant dren. If so, that bias should have been reflected in all the ratings 
predictors of academic achievement when entered simultaneously teachers provided, both the ones related to EF and the ones related 
with EF direct child assessments. to social interactions. Therefore, we also assessed whether teach- 

As shown in Table 3, Model 4, for literacy gains both teacher ers’ fall ratings of their children from the other CFBRS scale, the 
ratings and EF assessments of Corsi Forward and Copy Design IPS subscale, also accounted for unique variation in academic 


continued to be significant predictors of literacy outcomes. Simi- achievement gains across the pre-k year above and beyond vari- 
larly both teacher ratings and the EF component score were ance accounted for by children’s EF direct assessment as well as 
uniquely related to literacy gains (Model 5). For language out- covariates including the academic skills pretests. Teacher ratings 
comes presented in Table 4, none of the individual EF measures of children’s interpersonal skills were not predictive of their aca- 
was significantly associated with outcomes, but Model 5 shows demic achievement gains above and beyond the EF direct assess- 


that the EF component score was a marginal predictor when ments when entered individually for literacy (8 = .04, SE = .03, 
included in the model with teacher reports. Table 5 shows that p = .171), language (8B = .03, SE = .02, p = .226), or mathematics 
Corsi Forward, DCCS, Peg Tapping, Copy Design (Model 4), and (8 = .003, SE = .02, p = .901). Teacher ratings of children’s 
the EF component score (Model 5) were significant or marginal interpersonal skills were also nonsignificant predictors of aca- 
predictors of mathematics outcomes along with teacher ratings. demic achievement gains when entered into models with the EF 
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Table 5 
EF Direct Assessments and Teacher Reports Predict End of Prekindegarten Mathematics Skills 


Unconditional 
model Model 1 Model 2 Model 3 Model 4 Model 5 


ee  ———————— _ _——— 


Variable B SE B SE B SE B SE B SE B SE 


Fixed Parameters 


Intercept 0.05 0.06 0.05 0.05 0.05 0.05 0.04 0.06 0.04 0.05 0.04 0.05 
Gender 0.00 0.02 0.02 0.02 0.02 0.02 0.03 0.02 0.04 0.02 0.04 0.02 
Age —0.02 0.03 —0.06* 0.03 = 01055 0/03) 0103 0.03. —0.06* 0.03 —0.06* 0.03 
Curriculum Condition —0.06 0.06 —0.08 0.06 —0.07 0.06 —0.04 0.06 —0.06 0.06 —0.06 0.06 
IEP Status —0.02 0.03 0.00 0.02 —0.01 0.02 0.00 0.02 —0.01 0.02 0.01 0.02 
Pre-Post Interval 0.05 0.03 0.03 0.03 0.04 0.03 0.04 0.03 0.03 0.03 0.03 0.03 
Pretest 0.74 0.03 OlrolEs 0.03 0:60 70:03 0.67" — 0.06 0.60** 0.03 O57 0:03 
Corsi Forward 0.12™* 0.03 OMe 0.03 

Corsi Backward 0.00 0.03 —0.01 0.02 

DCCS 0.07* 0.03 0.06* 0.03 

HTKS 0.02 0.03 0.02 0.03 

Peg Tapping 0.06" 0.03 0.05" 0.03 

Copy Design 0.07" 0.03 0.06* 0.03 

EF Composite Score O23 cares 008, 0:207=5) 0.03 
EF Teacher Report ON ASeanO03 0.10** 0.03 O10 e003 

Random Parameters 

Child 039% 0:02 0.36™ 0.02 0.36 0.02 OST = 0102 0.36" 0.02 0.36 0.02 
Classroom 0.03* 0.01 0.02 0.02 0.03* 0.01 0.03* 0.01 0.02 0.02 0.03* 0.01 
School 0.00 0.02 0.00 0.02 0.00 0.00 0.00 0.00 0.00 0.02 0.00 0.00 
System 0.00 0.01 0.00 0.01 0.00 0.01 0.00 0.01 0.00 0.01 0.00 0.01 
Psuedo-R* 0.06 0.05 0.03 0.07 0.07 





Note. EF = executive functioning; IEP = individualized education plan; DCCS = Dimensional Change Card Sort (Zelazo, 2006); HTKS = Head Toes 
Knees Shoulders (Ponitz, McClelland, Matthews, & Morrison, 2009). Pseudo-R? estimates indicate the amount of within-child variability in end of 
prekindergarten mathematics skills explained by the addition of EF measures to the unconditional model. 
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composite score—literacy (B = .04, SE = .03, p = .193), language correlations among EF direct child assessments themselves. This 
(B = .03, SE = .02, p = .221), mathematics (8 = .01, SE = .03, provides further evidence that these two different methodologies 
p = .829). may be tapping into similar skill sets in young children and also 
speaks to the measurement issues with individual direct assess- 

Discussion ments. Although previous research has estimated correlations be- 

tween the HTKS and teacher reports of behavioral regulation in the 
early childhood classroom (McClelland et al., 2007), the current 
study extends this research by creating a component score of EF 
that draws upon the common variance among these measures, and 
associating it with teacher reports of specific EF behaviors in the 
classroom. The correlations obtained in the current study were 
much higher than those obtained by McClelland et al. (2007) using 
only the HTKS, suggesting the possibility that when the variance 
unique to each individual assessment is removed, the overlap 
between teacher reports and child direct assessments of EF may be 
higher than previously reported. However, the correlations be- 
tween teacher reports of EF and direct child assessments were not 
so high as to suggest complete redundancy in measurement. It is 
quite possible that the teacher report may tap skills that are not 
being tapped by direct child assessments, lending support to the 
idea that both methodologies may yield important information 


In the present study, we compared the unique contributions of 
teacher reports of EF and direct child assessments of EF when 
added into a model simultaneously to predict gains in academic 
skills across the pre-k year. Three important findings emerged. 
First, children’s EF skills both as rated by teachers and as observed 
in direct child assessments were significantly related to each other. 
Second, when entered in separate models, direct assessments and 
teacher reports of children’s EF skills at school-entry were signif- 
icantly related to their academic gains in literacy, language, and 
mathematics in pre-k above and beyond covariates. Third, both 
teacher reports of children’s EF and direct assessments of EF 
remained significant predictors of literacy and mathematic gains 
even when both were entered into a model simultaneously. Direct 
assessments of EF were only marginally associated with language 
gains when entered into the model simultaneously with teacher 


reports. 

ea yea EF was strongly related to behavioral assess- about young children’s EF skills. Direct assessments may capture 
ments of children’s EF, which suggests that they may be tapping children’s available cognitive processes and teacher reports may 
similar if not identical underlying characteristics; our correlations assess how these processes are used in a real-world setting. 
were stronger than those summarized by Toplak et al. (2013) in We found positive associations between children’s entering EF 
clinical populations. In fact, as can be seen in the simple correla- skills, assessed through direct child assessments and teacher re- 


tions, in some cases the teacher reports of EF were more highly ports, and their gains in literacy, language, and mathematics skills. 
correlated with individual EF direct assessments than some of the When entered into models separately, the effect size estimates of 
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the unique variance accounted for by EF (above and beyond 
covariates and pretests) were larger for the model of teacher 
reports predicting literacy and language compared to the models 
including direct assessments This was not the case for mathematics 
achievement gains, as the addition of direct assessments of EF to 
the model accounted for a larger proportion of variance compared 
to the addition of teacher reports. When examining teacher reports 
and direct assessments entered together as predictors, effects dif- 
fered such that after controlling for covariates and pretests, the 
strongest effect of children’s school-entry EF skills was observed 
for gains in mathematics achievement. Effect sizes were smaller 
across the board for the variance accounted for by EF above and 
beyond language pretests and covariates. 

While the pseudo-R? estimate of effects size indicated that EF 
skills accounted for 7% of the variance in gains in children’s 
mathematics achievement, it is necessary to interpret the magni- 
tude of this effect based on its practical value for a given field 
(Cumming, 2014). In the present case, the practical significance of 
the effects must be interpreted in light of explaining unique vari- 
ance in gains in children’s academic achievement beyond that 
which can be explained by initial achievement as well as covari- 
ates. Such a conservative approach suggests that even for smaller 
pseudo-R? estimates, the effects remain meaningful for practice. 

Several researchers have previously posited that mathematics 
skills uniquely tax children’s inhibitory control, attentional flexi- 
bility, and working memory (Blair, Knipe, & Gamson, 2008). 
Specifically, recent accounts suggest that although children may 
initially heavily recruit EF resources for academic learning across 
domains, certain skills, such as literacy, may become more auto- 
matic requiring less higher order problem solving compared to 
mathematics tasks, which likely only increase in their cognitive 
demands as new mathematical concepts are learned (e.g., Blair et 
al., 2008; Welsh et al., 2010). Our results are consistent with this 
account when examining the variance in mathematics achievement 
gains accounted for by both EF direct child assessments and 
teacher report simultaneously, such that the effect size values were 
the largest for mathematics models. However, we found somewhat 
smaller but still important predictions from EF measures to gains 
in literacy, suggesting that during the pre-k year at least, these 
skills are not as automatic as they will become in kindergarten and 
first grade. 

When examining direct child assessments and teacher reports 
separately we found that teacher reports of EF accounted for more 
variance in literacy and language achievement gains than did EF 
direct assessments examined alone. Conversely, the model with EF 
direct assessments alone accounted for more variance in mathe- 
matics achievement gains compared to the model with EF teacher 
reports alone. Prior work on EF and achievement is largely based 
on the use of direct child assessments; the current study results 
suggest that the addition of teacher reports will yield a more 
comprehensive picture, at least in pre-k. It could be that previous 
findings were limited by the use of measures of only one meth- 
odology, and perhaps that literacy, language, and mathematics 
skills may all be influenced by EF skills in the pre-k year but in 
different ways. For example, it could be the case that literacy and 
language skills are more affected by the types of skills tapped by 
teacher reports, namely, how children use their EF skills in class- 
room learning, whereas mathematics skills are more directly af- 
fected by the cognitive processes themselves. Future research is 


necessary to examine these effects beyond pre-k and into early 
elementary school to determine if differential patterns emerge. 

The difference in magnitude of effects of EF skills on literacy 
and mathematics gains compared to language gains is not 
necessarily surprising considering the nature of pre-k instruc- 
tion in which literacy and mathematics skills receive more 
explicit attention compared to language skills. Thus, the bene- 
fits of having greater EF skills that allow children to attend and 
remain engaged during classroom instruction may be more 
relevant for early literacy and mathematics skills compared to 
language skills. Also, in this particular data set, children made 
substantially less gain in the language across the pre-k year, and 
the correlations from pre- to posttest were very high, suggesting 
strong stability and less intraindividual variability to be ex- 
plained by EF measures. 

It is worth commenting on the fact that we did not find HTKS 
to be a unique predictor of gains in any of the three academic 
areas, which is in contrast to several recent studies with this 
measure (e.g., McClelland et al., 2007; Wanless et al., 2011). A 
big difference in our study compared to the ones cited above is 
that we used several direct assessment measures and not just 
HTKS alone. It is apparent from Table 2 and the zero-order 
correlations that Peg Tapping and HTKS were the most highly 
correlated among the direct assessment measures (r = .52). Peg 
Taping generally correlated more highly with the other direct 
assessments than HTKS did. It is possible that alone HTKS is 
an important contributor to achievement gains, but its contri- 
bution was swamped by the stronger relations between gains 
and some of the other EF direct assessment measures. 

Taken together, these findings illustrate the potential for eco- 
logically valid teacher ratings of EF skills in combination with 
direct assessments, especially with regard to examining the inter- 
relations between EF skills and academic achievement in field- 
based research. Importantly, the same pattern of associations was 
not observed for teacher reports of children’s social skills, sug- 
gesting that teachers were not simply more positively predisposed 
to some children than others and also as Duncan et al. (2007) also 
found, that social skills may be important but perhaps not for 
academic achievement. Teachers seem to be able to recognize 
specific types of behaviors in young children that will facilitate 
their learning in the classroom over the year. Researchers will 
sometimes use a battery of direct assessments when measuring the 
various aspects of children’s EF but seldom does the battery 
include teacher reports. Such batteries can require taking children 
out of their classrooms for one-on-one testing for up to 45 min, 
which is not always desirable or feasible. Moreover, because 
traditional direct assessments of EF were initially developed for 
use in neurological research, many assessments of EF are not 
situated within the context of typical preschool learning activities. 
Teacher reports that are easily administered and ecologically valid 
could considerably enhance our understanding of the interrelations 
between the development of EF and early academic success, 
particularly when used in conjunction with direct child assess- 


ments. 
Limitations 


The contributions of this research notwithstanding, study 
limitations must be acknowledged. While the current findings 
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help us better understand the unique value of various modes of 
assessing young children’s EF skills, the correlational design of 
the study does not permit causal conclusions regarding the 
associations between children’s EF skills and gains in achieve- 
ment over the pre-k year. In particular, while we demonstrated 
that teacher ratings were convergent with traditional direct 
assessments of EF and explained unique variance in children’s 
literacy, language, and mathematics achievement in conjunction 
with direct assessments of EF, it is possible that teacher ratings 
could be capturing other aspects of children’s scholastic abili- 
ties. It may be that future research could be conducted to further 
validate the use of teacher ratings by including statistical con- 
trols for other aspects of children’s scholastic abilities that 
might be confounded with ratings of EF skills, including gen- 
eral intelligence. Another limitation with the use of the Work- 
Related Skills subscale of the Cooper-Farran Behavioral Rating 
Scale is that it does have three items that reference emotion or 
social context. Thus, it is possible that although the measure 
primarily assesses EF, it may also capture other additional skills 
as well. Future work should examine associations between the 
Work-Related Skills subscale and other assessments of emotion 
regulation to get a clearer picture of the extent to which the 
Work-Related Skills subscale might also capture more affec- 
tively laden components of self-regulation skills. 


Conclusions 


Both EF direct assessments and EF teacher reports explained 
unique variance in children’s academic achievement gains in 
literacy and mathematics. Teacher reports may be an ecologi- 
cally valid and efficient way of capturing children’s EF devel- 
opment in early childhood classrooms when direct assessments 
are not feasible. If possible, including both teacher ratings and 
one or more direct assessments would seem to be the best 
course. 
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Appendix 
Multilevel Regression Equation for Model 4 (See Model 4 in Table 4) 


All multilevel models were run in IBM SPSS (Version 20 
Mixed Models) using restricted maximum-likelihood estimation. 
Provided below is.a sample model equation for the EF direct 
assessments and teacher reports entered simultaneously to predict 
language outcomes. In the Level 1 equation, the posttest language 
score (Y) for a child (i) who is in classroom (j) situated in school 
(k) and system (/) is a function of the intercept of the mean 
language score (8,,;) and the fixed effects associated with a vector 
of demographic covariates, including pretest language score, in- 
terval of time elapsed between pretest and posttest, age at pretest, 
gender, and individualized education plan status (2B, ,,;). A child’s 
posttest language score is also a function of the child’s teacher- 
reported EF skills (8,,,,), the individual child direct assessments of 
EF skills (B3;2 Bajar Bsjea Bojer Br and Bg,,)), and the Level 1 
random effect of the mean language score for children in each 
classroom (€;;,,). In the Level 2 equation, the intercept is a function 
of the classroom mean language score (Yo ,,), the fixed effects of 
the Level | predictors (Y,990 - - - Ygooo)» and the Level 2 random 
effect associated with the intercept (oj). At Level 3, the intercept 
is a function of the school mean language score (Tj), the 
experimental condition assignment of the school (719 9,;), and the 
Level 3 random effect associated with the intercept (€ 0,;). Last at 
Level 4, the intercept is a function of the system mean language 
Score ([gg009), the Level 3 experimental condition ([99;0), and the 
Level 4 random effect associated with the intercept (@g 0). 

Level 1 (child level): 


Yijkl — Bojxa Ae > B1jx(Covariates) 
+ Boj( TEACHER REPORT) + 3;4(CORSI FORWARD) 


+ Baj(CORSI BACKWARD) + B5j4(DCCS) + Beju(HTKS) 
+ Bzju(PEG TAPPING) + Bsi4(COPY DESIGN) + €jj4 


Level 2 (classroom level): 


Boje = Yoou + Nojki 


Biju = Yio00 
Boj = Y2000 
B3jxi = ‘Y3000 
Bajxi = ‘Y4000 
Bsixa = ‘Ys000 
Bejxr = ‘Yeooo 
Baja = Y7000 
Bg jn = ‘Ygooo 


Level 3 (school level): 
Yoox’ = Tooor + Too (CONDITION) + Epox 


Level 4 (system level): 


Too + Boooo + o001 


T0011 — Moo10 
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The Effect of Training and Consultation Condition on Teachers’ 
Self-Reported Likelihood of Adoption of a Daily Report Card 


Alex S. Holdaway and Julie Sarno Owens 
Ohio University 


Using a within-subjects design and validated vignettes, this study examined the relative effects of four 
training and consultation conditions (i.e., consultation with key opinion leaders, consultation with 
observation and performance feedback, consultation with motivational interviewing, and professional 
development-as-usual) on teachers’ (NV = 157) self-reported ratings and rankings of the likelihood of 
adoption of a daily report card intervention for students with disruptive behaviors. The consultation with 
key opinion leaders condition produced significantly higher ratings of the likelihood of reported adoption 
than did the consultation with motivational interviewing or professional development-as-usual condi- | 
tions, and was ranked higher than all other conditions. Professional development-as-usual was rated and 
ranked significantly lower than all other conditions. Teacher factors, including teacher experience and 
teacher burnout, were evaluated as predictors of adoption ratings. Implications and recommendations 
regarding the use of training and consultation conditions in research and practice are discussed. 


Keywords: teacher consultation, teacher professional development, classroom intervention, daily report 


card, intervention adoption 


Millions of dollars and hours are spent by school systems each 
year so that teachers are theoretically prepared and proficient to 
use up-to-date, evidence-based instruction, technology, and class- 
room management skills. Research on best practices for profes- 
sional development (PD) suggest that programs that successfully 
produce change in teacher behavior are those that are of sufficient 
duration to learn and test new skills, include opportunities for 
active learning such as modeling and demonstration, and provide 
ongoing consultation’ to support the specific skills being taught, 
and have content tailored to the teacher’s specific situation (i.e., 
grade level, occupational responsibilities) (Darling-Hammond, 
Chung Wei, Andree, Richardson, & Orphanos, 2009; Yoon et al., 
2007). Although evidence-based guidelines for PD are available, 
research suggests that the majority of PD programming offered to 
teachers is delivered in a brief workshop format with little or no 
opportunity for skill acquisition or ongoing follow-up (Darling- 
Hammond et al., 2009; Yoon et al., 2007) and is limited in 
changing teacher practices and behavior. Indeed, the results of one 
survey indicated that when asked to consider the last 3 years of PD 
experiences, less than 25% of teachers reported that their PD 
experiences have had an impact on their instruction (Hudson, 
McMahon, & Overstreet, 2002). 
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National surveys indicate that few teachers report feeling ade- 
quately trained to manage student disruptive behavior (National 
Comprehensive Center for Teacher Quality, 2012; The New 
Teacher Project, 2013), and elementary school teachers rank class- 
room management as their second greatest area of need for PD, 
behind only instructional skills (Coalition for Psychology in 
Schools and Education, 2006). In one study of over 7,000 educa- 
tors, less than 30% perceived their training in student discipline 
and management as “useful” or “highly useful” (Darling- 
Hammond et al., 2009). Thus, not only are few teachers exposed to 
PD training in classroom management, but the PD literature sug- 
gests that the most frequently used methods are likely insufficient 
to change actual teacher practices (Darling-Hammond et al., 2009; 
National Council on Teacher Quality, 2014; National Research 
Council, 2001). Given these limitations, it may be unrealistic to 
expect teachers to adopt evidence-based behavior management 
interventions without supports that differ substantially from PD- 
as-usual (i.e., a workshop with no follow-up support). Because 
teacher-reported intentions to adopt a practice are moderately 
correlated with actual behavior (Armitage & Conner, 2001) and 
teacher perceptions of interventions have been shown to be posi- 
tively associated with sustained use of the intervention (Baker, 
Kupersmidt, Voegler-Lee, Arnold, & Willoughby, 2010), examin- 
ing teacher-reported intentions to adopt an intervention, given 
specific types of training and consultation, may be an effective 
way for researchers and administrators invested in choosing inter- 
vention training and consultation supports to help forecast how 


‘We use the term consultation throughout this document to describe 
individual meetings between a teacher and another individual tasked with 
helping improve implementation of an intervention. However, it is impor- 
tant to note that the term coaching could also have been used, as clearly 
differentiated definitions of consultation and coaching have not been well 
articulated in the literature (Denton & Hasbrouck, 2009). 
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teachers will respond when provided with differing types of inter- 
vention supports. 

Using a within-subjects design and validated Vignettes, we ex- 
amined the relative effects of three enhanced training and consul- 
tation conditions (hereafter referred to as TCCs) in comparison to 
a PD-as-usual condition on teachers’ perceptions of the likelihood 
of adopting an evidence-based behavior management intervention. 
Specifically, the study addresses the following questions: (a) Do 
enhanced TCCs produce higher ratings and rankings of likelihood 
of adoption of an evidence-based classroom intervention as com- 
pared with PD-as-usual? (b) Do specific, teacher-level factors 
predict ratings of adoption in the context of each TCC? Below, a 
critical review of promising TCCs as well as factors that may 
impact teachers’ likelihood of intervention adoption is provided. 


Teacher Adoption of Interventions 


In the field of implementation science, the study of the transla- 
tion of research to practice, several frameworks have been pro- 
posed to convey the phases that individuals and organizations go 
through when faced with the decision to adopt and implement a 
new intervention (e.g., Aarons, Hurlburt, & Horwitz, 2011; Fixsen, 
Blase, Naoom, & Wallace, 2009). Across the various frameworks, 
adoption represents the first phase and is followed by the imple- 
mentation and sustainment phases. Although exposure to and 
experimentation with a given intervention represents the adoption 
phase, adoption is often conceptualized as a “one-time event” that 
an individual or organization makes, that is, a decision to adopt or 
reject the intervention (Aarons et al., 2011, p. 9). Implementation 
refers to the active use of an intervention once it has been adopted. 
Acceptability refers to whether an intervention is viewed by stake- 
holders (e.g., teachers) as appropriate for the problem, fair, and 
reasonable (Kazdin, 1980). Theoretically, interventions with 
higher acceptability ratings are more likely to be adopted, imple- 
mented with high integrity, and sustained than interventions with 
lower acceptability ratings. Although intervention adoption and 
implementation are complex processes, these definitions offer a 
useful delineation for the purpose of this study. 

The authors of the implementation models have also identified 
factors theorized to enhance or interfere with adoption and imple- 
mentation, including characteristics of the intervention, the teacher 
(or implementer), and the school building or district. Although the 
three phases can overlap and are not limited to a linear sequence 
(Fixsen et al., 2009), adoption may be particularly important to 
evaluate for two reasons: (a) The theoretical models suggest that 
by understanding factors that positively influence adoption, school 
systems can increase efficiencies in the dissemination of interven- 
tion to children in need and (b) such enhanced efficiencies may 
reduce costs and wasted resources, and potentially increase the 
likelihood of continued implementation and sustainment. 

However, though theoretical models have emerged that describe 
the potential importance of the adoption process, few studies have 
examined the effect of PD supports on perceptions of the “adopt- 
ability” of the intervention. In one example, researchers examining 
teacher participation in a universal prevention program designed to 
reduce problem behavior and enhance school readiness in pre- 
schoolers found that teachers’ concerns prior to any training or 
implementation were negatively associated with actual participa- 
tion (i.e., implementation of the intervention) throughout the study 


(Baker et al., 2010). Further, teacher characteristics, including 
perceptions of professional support and occupational satisfaction, 
were positively related to participation. Therefore, it is possible 
that by either (a) identifying or adapting an intervention such that 
the intervention results in fewer preadoption teacher concerns or 
(b) better identifying those teachers who may be in need of extra 
resources or supports to adopt and implement interventions, 
teacher adoption and implementation rates might be enhanced, 
significantly reduce wasted resources and costs, and increase ef- 
ficiencies in extending the reach of interventions to students. 
However, as noted by implementation science theorists, empirical 
study of the posited conceptual models is in its infancy and, to 
date, has largely been focused on the implementation phase rather 
than the adoption phase (Fixsen et al., 2009). Because the adoption 
phase sets the stage for the remaining processes, research on the 
adoption of interventions by teachers could have a significant 
impact on the development of, and costs associated with, teacher 
PD. Below, enhanced models of teacher PD that may have a 
positive effect on teacher’s adoption of evidence-based behavior 
management interventions are reviewed. 


Consultation With Key Opinion Leaders 


One proposed mechanism for increasing adoption of interven- 
tions is the use of key opinion leaders (Atkins et al., 2008). Key 
opinion leader procedures emphasize the use of respected peer 
teachers as intervention advocates and consultants to disseminate 
intervention components. This approach theoretically capitalizes 
on the advantages in contact, respect, and trust that a key opinion 
leader teacher possesses over an external consultant to improve 
intervention adoption and implementation. In published trials, key 
opinion leaders have held a role similar to an informal consultant. 
Specifically, key opinion leaders speak with peers about which 
techniques have worked in their classroom, discuss barriers and 
help problem-solve implementation challenges, but do not engage 
in specific observations or skill-based practices (Atkins et al., 
2008). Further, key opinion leader support has been conceptual- 
ized as consultation on an as-needed basis, as opposed to regularly 
scheduled consultation meetings. In the only published reports of 
key opinion leader influence on teacher adoption of a classroom 
intervention, researchers found that teachers self-reported greater 
adoption and implementation of intervention strategies when 
working jointly with key opinion leaders and mental health con- 
sultants than when working with mental health consultant support 
alone. However, this initial advantage weakened over time such 
that the groups did not differ by the end of the trial’s second year 
(Atkins et al., 2008). The diminishing effect of key opinion leader 
influence over time may indicate that key opinion leader influence 
is a useful tool to promote initial adoption of an intervention, but 
may be relatively less effective as a means of ongoing implemen- 
tation support. 


Consultation With Observation and 
Performance Feedback 


Some teachers may elect not to adopt an intervention due to 
limited skill development in behavior management, experience 
with the intervention, or the opportunity to receive guidance about 
skill implementation. Performance Feedback typically involves 
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observations of the teacher’s implementation, followed by a re- 
view of data from the observation in graphic and/or written form, 
highlighting the teacher’s strengths and areas for improvement 
with the intervention components (Codding et al., 2005). Perfor- 
mance feedback consultation meetings are scheduled on a regular 
basis and occur over several weeks or months, until high-quality 
implementation is achieved. Although some inconsistencies exist, 
studies provide compelling data that performance feedback pro- 
duces higher levels of integrity to intervention procedures relative 
to baseline conditions or alternative strategies, that integrity de- 
clines precipitously in the absence of performance feedback, and 
that performance feedback can be delivered in a manner that is 
acceptable to teachers (e.g., Codding et al., 2005; Noell et al., 
1997). Compared with other options (e.g., key opinion leader and 
motivational interviewing), performance feedback has a compar- 
atively strong research base, with a large number of single-case 
studies and a handful of randomized controlled trials documenting 
its effectiveness in increasing implementation integrity (Solomon, 
Klein, & Politylo, 2012). Although these findings are encouraging, 
they yield little information about the likelihood that a teacher may 
elect to adopt an intervention for use in his or her classroom if 
intervention support includes performance feedback. Nonetheless, 
if teachers know that they will receive such ongoing support, it 
may address preintervention concerns that affect their decision to 
adopt an intervention. 


Consultation With Motivational Interviewing 


A third approach that is effective in improving adoption behay- 
iors in a number of health and mental health contexts is motiva- 
tional interviewing (Miller & Rollnick, 2013). Although motiva- 
tional interviewing has primarily been studied as an approach to 
improve the motivation of substance users and clinical populations 
to change health behaviors, consultation with motivational inter- 
viewing has recently been adapted as a tool to increase teacher 
adoption and integrity to intervention procedures (Frey et al., 
2013; Gueldner & Merrell, 2011; Reinke et al., 2012). Though 
differences exist in conceptualizations of motivational interview- 
ing in the school context, consultation with motivational interview- 
ing typically includes having the teacher meet one-on-one with a 
consultant to (a) explore the teacher’s values, (b) assess current 
practices via an interview and/or classroom observation, and (c) 
provide feedback on teacher’s current practices with exploration of 
teacher perceptions of evidence-based strategies that may address 
areas of concern (see Frey et al., 2013, for a helpful guide). 
Throughout this short-term process, consultants attempt to enhance 
initial motivation to adopt the intervention by connecting teacher 
values with the intervention (e.g., “I know you’ve told me before 
that you really value independence in your students. How might 
this intervention impact their independence?”) and eliciting 
“change talk” by specifically highlighting the discrepancy between 
the teacher’s values and the status quo (e.g., “Achievement is 
important to you. You also have mentioned to me that you have a 
number of students with behavior issues that aren’t doing. well 
academically. I wonder if this intervention could help them?”). 
Another important distinction is that, as compared with key opin- 
ion leaders and performance feedback, motivational interviewing 
does not necessarily include ongoing support throughout the year, 
although such procedures could be used in combination (e.g., 


Reinke et al., 2012). Though studies that use motivational inter- 
viewing as a theoretical guide for teacher consultation are growing 
(e.g., Frey et al., 2013; Gueldner & Merrell, 2011; Reinke, Her- 
man, & Sprick, 2011), there is still limited evidence of the impact 
of motivational interviewing on teacher adoption of interventions, 
specifically. As motivational interviewing is intended to directly 
affect teacher engagement and “buy-in,” data as to how teacher’s 
perceive consultation with motivational interviewing as compared 
with other TCCs are a critical step toward establishing its utility in 
enhancing intervention adoption. 


Teacher Factors Associated With Adoption 


Because there may not be a uniform preference for a single TCC 
across all teachers, identifying factors that predict adoption could 
have implications for differential consultation programming. 
Namely, consultation could be effectively tailored to the needs and 
characteristics of different subsets of teachers within a building or 
district. Predictors of adoption could also help to identify those 
teachers who may need additional encouragement or individual- 
ized attention if they are identified as less likely to adopt an 
intervention. Thus, factors that have been shown to be associated 
with the adoption and implementation of interventions were also 
examined as potential predictors of adoption. 


Teacher Experience 


New teachers often feel unprepared for the classroom behavior 
challenges that arise when educating students with disruptive 
behavior and consistently rate classroom behavior as a top reason 
for leaving the profession (Ingersoll, 2001). In a sample of over 
2,000 educators, 52% of first-year teachers ranked classroom man- 
agement as their number one PD need, with only 10% of teachers 
with over 10 years of experience ranking it as their number one PD 
need (Coalition for Psychology in Schools and Education, 2006). 
Although this finding suggests differences between teachers’ pri- 
orities related to levels of experience, the impact of specific types 
of TCC approaches on teachers with differing levels of experience 
is unclear. The various components of teacher training and con- 
sultation programs such as working with peers (key opinion 
leader) or receiving feedback from outside observers (performance 
feedback) may be received differently by teachers who have a 
depth of experience and are perceived as school leaders, as com- 
pared with new or less experienced teachers. 


Teacher Burnout 


According to Maslach and colleagues (1996), burnout is char- 
acterized by emotional exhaustion, disengagement, and a low 
sense of personal accomplishment. Studies of teacher burnout have 
revealed that student misbehavior is a significant predictor of 
teacher burnout (Hastings & Bham, 2003). Further, teachers with 
higher levels of burnout were found to endorse more negative 
attitudes about implementing a new academic program than col- 
leagues with lower levels of burnout (Evers et al., 2002). Thus, it 
is possible that this finding would generalize to teachers’ attitudes 
about implementing a new intervention targeting student misbe- 
havior. 
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Self-Efficacy 


Teacher self-efficacy is a teacher’s belief that he or she can 
perform a classroom procedure with a high degree of competence. 
A vignette-based study has shown that teachers high in self- 
efficacy rated consultation as being more effective and acceptable 
than teachers low in self-efficacy (DeForest & Hughes, 1992). 
Studies examining the relationship between self-efficacy and ac- 
tual intervention adoption document higher rates of adoption by 
teachers with higher self-efficacy than by teachers with lower 
self-efficacy (Rimm-Kaufman & Sawyer, 2004). This study ex- 
pands the exploration of teacher self-efficacy into the TCC do- 
main. 


Use of Behavioral and Instructional Strategies 


Many interventions for children with disruptive behaviors rely 
on behavioral principles such as contingency management and 
reinforcement. If teachers are already using behavioral principles 
and reinforcement in their classroom, it may be that teachers will 
be more likely to adopt a behaviorally based intervention, as it 
aligns with their current practices. As such, a measure of teachers’ 
use of behavioral and intervention strategies was included. 


Principal Support 


Principals are commonly the individuals tasked with being the 
“gatekeepers” for new programs introduced at the school (Hal- 
linger & Heck, 1996), with substantial influence over time spent, 
resource allocation, and incentives for adoption and implementa- 
tion. One study shows that in the dissemination of an evidence- 
based intervention for youth at risk for delinquency, positive 
intervention outcomes occurred only in those schools that had both 
high principal support and a high degree of classroom implemen- 
tation by teachers (Kam, Greenberg, & Walls, 2003). However, no 
studies have specifically examined the effects of principal support 
on teachers’ perceptions of intervention adoption for students with 
disruptive behavior. 


Interventions for Students With Disruptive Behavior 


The above literature review highlights the TCCs that may en- 
hance adoption of an intervention relative to PD-as-usual and 
factors that may impact teachers’ likelihood of intervention adop- 
tion. Now attention is turned to a specific evidence-based behav- 
ioral intervention that teachers could adopt, namely, the daily 
report card (DRC). The DRC is a well-established intervention 
with strong empirical support for effectiveness in both general and 
special education settings (Kelley, 1990; Owens et al., 2012; 
Vannest, Davis, Davis, Mason, & Burke, 2010). The DRC is a tool 
to monitor and modify clearly defined target behaviors (e.g., 
interruptions, work completion) by setting daily goals for the 
student (e.g., seven or fewer interruptions per day), providing 
feedback to the student, and providing rewards for attaining daily 
goals (Kelley, 1990). Despite teachers’ reported acceptability of 
the DRC (e.g., Girio & Owens, 2009), studies indicate that, outside 
of research-based programs, such interventions are underused 
(Martinussen, Tannock, & Chaban, 2011). By examining teacher 
preferences for, and perceptions of, enhanced TCCs, a better 
understanding of the types of PD that may lead to higher rates of 


adoption and use of strategies that effectively address disruptive 
student behavior may be achieved. Given that educating students 
with disruptive behavior has been linked to increased teacher 
stress, negative teacher—student interactions, and has been identi- 
fied as a major contributor to teacher turnover and job dissatisfac- 
tion (Brouwers & Tomic, 2000; Greene, Beszterczey, Katzenstein, 
Park, & Goring, 2002; Ingersoll, 2001), understanding how to 
facilitate teachers’ adoption and use of an intervention that im- 


_ proves the behavior of students with disruptive behavior, while 


reducing teacher stress and school expenditures, is an important 
priority. 


The Current Study 


This study addresses limitations in the teacher PD literature by 
(a) simultaneously examining the effects of multiple TCCs on 
reported adoption likelihood ratings, (b) examining the effects of 
TCCs on adoption rather than implementation or sustainability, 
and (c) concurrently examining multiple teacher factors to identify 
predictors of adoption likelihood ratings. These advancements 
allow researchers and school-based administrators to compare 
teacher perceptions of each TCC, identify unique predictors of 
reported adoption decisions, and disentangle the effects of the 
intervention itself from the TCC provided.” 

The first aim was to answer the questions: Do enhanced TCCs 
produce higher ratings and rankings of likelihood of adoption of an 
evidence-based classroom intervention as compared with PD-as- 
usual? If so, which TCC do teachers prefer and rate as most likely 
to result in intervention adoption? Given teacher reports of the 
current state of PD (Hudson et al., 2002), we hypothesized that key 
opinion leader, performance feedback, and motivational interview- 
ing TCCs would be preferable to the PD-as-usual condition. The 
second aim was to answer the question: Do specific, teacher-level 
factors predict ratings of adoption in the context of each TCC? 
Given that multiple factors have not previously been examined 
simultaneously, hypotheses about the relative predictive utility of 
each factor were not made. 


Method 


Participants 


An a priori power analysis, using G*Power, version 3.1.2 (Faul, 
Erdfelder, Lang, & Buchner, 2007), was performed to determine 
an appropriate sample size for the analyses, with priority given to 
the repeated measures analysis of variance (ANOVA) for the first 
aim. Input parameters were conservatively selected (effect size = 
1, a = .05, 1 — B = .8) so that small between-condition differ- 
ences could be detected. It was determined that 138 individuals 
would be adequate to achieve the desired power for the analyses 
associated with the first aim. 

Participants were 157 teachers (87.9% female; 96.8% Cauca- 
sian; 61.1% with a master’s degree; 22.6% with a special educa- 
tion certification) from eight schools in three school districts in 


? We could have examined various combinations of the TCCs, as teach- 
ers may have access to more than one at a time; however, understanding the 
impact of individual TCCs was considered to be the most parsimonious and 
prudent at this stage of the research. 


226 HOLDAWAY AND OWENS 


southeastern Ohio, selected on the basis of geographic proximity 
and administrator amenability to study procedures (i.e., a sample 
of convenience). Years of experienced ranged from 1 to 43 (M = 
17.83, SD = 14.71), and the average teacher age was 43.24 (SD = 
11.21). Participating schools serve approximately 2,750 students 
(97.4% Caucasian) in grades pre-K through Grade 6, of which 
54%-78% of students were eligible for free and reduced-price 
lunches (U.S. Department of Education, 2009) and 15.2% had a 
reported disability. For six schools, study measures were collected 
prior to an in-service training in August 2011. For the remaining 
two schools, study measures were collected in a group format 
during faculty meetings in January and February 2012. The overall 
teacher response rate was 85.8%. Individual school response rates 
ranged from 83.7% to 87.7%. 


Measures 


Demographic data and principal support. Teachers re- 
ported years of teaching experience and perceptions of principal 
support. Using a 4-point scale (ranging from 1 not supportive to 4 
very supportive), teachers rated how supportive the principal was 
of (a) general use of classroom interventions for children with 
disruptive behavior and (b) the teacher, personally, using class- 
room interventions for children with disruptive behavior. 

Use of behavioral and instructional approaches. The Jn- 
structional and Behavior Management Approaches Survey (Mar- 
tinussen et al., 2011) is a 39-item scale that measures teachers’ 
self-reported frequency of use of instructional and behavior man- 
agement approaches. Items are rated on a 5-point scale ranging 
from 1 (rarely) to 5 (most of the time). The survey has two 
subscales: Behavior Management Techniques (19 items) and In- 
structional Approaches (20 items). Because the survey authors 
found the two subscales to be highly correlated (r = .74), and 
previous studies using similar measures have not differentiated the 
two subscales (e.g., Fabiano et al., 2002), a total scale score was 
used in this study. Three items were removed (verbal reprimands, 
remove student from class, and lowering expectations for work) 
due to corrected item-total correlations less than .2. The internal 
consistency coefficient for the 36-item scale was .93. Possible 
scores range from 36 to 180. 

Teacher self-efficacy. The Ohio State Teacher Efficacy Scale 
(Tschannen-Moran & Hoy, 2001) is a 12-item scale consisting of 
three subscales: Efficacy for Instructional Strategies, Efficacy for 
Classroom Management, and Efficacy for Student Engagement. 
Subscale scores range from 4 to 36; higher scores indicate higher 
efficacy. The measure has adequate convergent validity with mea- 
sures of teacher self-efficacy (Tschannen-Moran & Hoy, 2001) 
and has been found to have a three-factor structure in exploratory 
and confirmatory analyses (Tschannen-Moran & Hoy, 2001). In- 
ternal consistency coefficients in the current study were .91 (total 
scale), .81 (instructional strategies subscale), .89 (classroom man- 
agement subscale), and .84 (student engagement subscale). 

Teacher burnout. The Maslach Burnout Inventory- 
Educators Survey (MBI-ES; Maslach & Jackson, 1986) is-a 22- 
item self-report scale that assesses teacher burnout. The measure 
yields a total score and three subscales: Emotional Exhaustion, 
Depersonalization, and Sense of Personal Accomplishment. Items 
are rated on a scale ranging from 0 (never) to 6 (every day). 
Subscale scores range from 0 to 54. The MBI-ES is one of the most 


often used measures of burnout in the field of education and has 
been found to have a consistent three-factor structure across in- 
vestigations of educator burnout (see Worley, Vassar, Wheeler, & 
Barnes, 2008, for a meta-analysis). Studies distinguished burnout 
from job dissatisfaction and established burnout as a distinct 
construct (Leiter & Durup, 1994). Internal reliability coefficients 
were .90 for Emotional Exhaustion, .76 for Depersonalization, and 
716 for Sense of Personal Accomplishment (Iwanicki & Schwab, 
1981). In the current sample, alpha coefficients were .90, .69, and 
.79, for Emotional Exhaustion, Depersonalization, and Sense of 
Personal Accomplishment, respectively. 

Likelihood of intervention adoption. The Jntervention Sup- 
port Questionnaire (ISQ) is a self-report questionnaire designed 
for the current study to assess the likelihood of DRC adoption 
when teachers are provided with different TCCs. The ISQ includes 
vignette descriptions of (a) a child demonstrating moderate levels 
of attention-deficit/hyperactivity disorder symptom severity who is 
moderately disruptive in the classroom; (b) a teacher-led individ- 
ualized intervention (i.e., DRC) that was appropriate and feasible 
to address the child’s difficulties; and (c) descriptions of TCCs that 
reflect consultation with a key opinion leader, consultation with 
performance feedback, consultation with motivational interview- 
ing and PD-as-usual procedures (see ISQ construction and valida- 
tion information in the Appendix). After reading each TCC de- 
scription, participants reported the likelihood of adopting the 
intervention given the TCC provided using a scale that ranged 
from 0 (There is no chance I will use this intervention) to 100 (I 
will definitely use this intervention), with 10-point increments. The 
ISQ also provides space to write a narrative description of what 
aspects of the TCC description influenced the respondent’s likeli- 
hood of adoption ratings (analysis of these data are in progress). 
After reading all four TCCs, respondents also were asked to rank 
order the TCCs from 1 (most likely to result in intervention 
adoption) to 4 (least likely to result in intervention adoption) in a 
forced-choice format. 


Data Collection Procedures 


All procedures were reviewed and approved by the university 
Institutional Review Board in an expedited review. Participation 
was voluntary. Teachers received a token of appreciation with 
value less than $10 for participation. Teachers were provided with 
a packet containing a consent form and all questionnaires. Within 
each packet, questionnaire order was counterbalanced. Further, 
using a within-subjects design, the order of vignettes was also 
counterbalanced within the ISQ to address potential ordering ef- 
fects. Consent forms and verbal descriptions (for participants who 
completed forms in person) indicate that no identifiable results 
would be communicated with school administrators or principals 
and that the purpose for the study was to examine teachers’ 
preferred intervention supports. Participants were instructed to 
complete all forms and return the packet to research staff. Packets 
were left at the schools for teachers unable to attend the data 
collection meetings and retrieved 1 week later (14.6% of the 
sample). No differences were found between teachers who com- 
pleted the packets in person or independently, between teachers 
from different schools, or between teachers who completed mea- 
sures in summer versus winter. 
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Results 


Aim 1: Do Enhanced TCCs Produce Higher Reports 
of Likelihood of Intervention Adoption as Compared 
With PD-as-Usual? 


Likelihood of adoption ratings. Teachers’ self-reported like- 
lihood of intervention adoption rates were examined using a one- 
way repeated measure ANOVA. TCC was a repeated measures 
variable because each teacher completed ratings and rankings for 
all four TCCs. Omnibus results determined that mean likelihood of 
adoption likelihood ratings differed significantly as a function of 
TCC, F(3, 465) = 58.79, p < .001. The average rates of likelihood 
of adoption for each condition are presented in Table 1. 

Post hoc tests of adoption ratings using the Bonferroni correc- 
tion and Morris and Deshon’s (2002) equation for within-subject 
effect sizes, controlling for dependence, revealed that the key 
opinion leader condition received the highest mean likelihood of 
adoption rating among the TCCs (M = 79.58) and was rated 
significantly higher than the motivational interviewing (MV = 
68.75; ES = .47, p < .001) and PD-as-usual conditions (M = 56.4; 
ES = .93, p < .001), but not the performance feedback condition 
(M = 75.42; ES = .21, p = .070). The performance feedback 
condition was rated significantly higher than the motivational 
interviewing (ES = .30, p < .01) and PD-as-usual condition (ES = 
.75, p < .001). Motivational interviewing was rated significantly 
higher than the PD-as-usual condition (ES = .52, p < .001). Thus, 
teachers reported a greater likelihood of adoption of a DRC inter- 
vention when presented with the key opinion leader or perfor- 
mance feedback conditions than if presented with motivational 
interviewing or PD-as-usual, and a greater likelihood of adoption 
if presented with motivational interviewing than if presented with 
PD-as-usual. As hypothesized, the key opinion leader, perfor- 
mance feedback, and motivational interviewing TCCs produced 
higher ratings than the PD-as-usual condition (see Table 1). 

Likelihood of adoption rankings. The average adoption 
rankings for each condition are presented in Table 1. To examine 
teachers’ likelihood of adoption rankings, a Wilcoxon signed-rank 
test was conducted (see Table 1). The key opinion leader condition 
was ranked as the most preferred strategy, significantly more often 
than performance feedback (Z = —3.558, p < .001), motivational 





Table 1 
Intervention Adoption Likelihood Ratings and Rankings by TCC 
Support type Mode M SD 
Ratings 
Key opinion leader 90 IS39% 18.72 
Performance feedback 90 IS255 22.20 
Motivational interviewing 80 68.63, 23.20 
PD-as-usual 50 56.41, 24.68 
Rankings 
Key opinion leader 1 1.78, 0.97 
Performance feedback 2 2.28, 1.03 
Motivational interviewing 3 2.74, 0.92 
PD-as-usual 4 3.194 1.03 


Note. Higher ratings reflect higher likelihood of adoption; lower rankings 
reflect lower likelihood of adoption. Conditions with different subscripts 
are significantly different at the p < .05 level. TCC = training and 
consultation condition; PD = professional development. 


interviewing (Z = -—6.821, p < .001), and PD-as-usual 
(Z = —7.837, p < .001). The performance feedback condition was 
ranked second highest and significantly more often than the mo- 
tivational interviewing condition (Z = —3.295, p < .001) and 
PD-as-usual condition (Z = —5.986, p < .001). Finally, the 
motivational interviewing condition was ranked as the third most 
preferred strategy, significantly more often than the PD-as-usual 
condition (Z = —3.166, p < .001). Thus, when presented with a 
forced-choice ranking format, teachers ranked the key opinion 
leader condition as most likely to result in intervention adoption, 
followed by the performance feedback, motivational interviewing, 
and PD-as-usual conditions. The percentages of each rank that 
each TCC received can be found in Table 2. 


Aim 2: Do Teacher-Level Factors Predict Ratings of 
Adoption in the Context of Each TCC? 


Ten predictors of likelihood of adoption ratings were examined 
simultaneously in a series of four linear regressions; one regression 
model per TCC. Each regression included an examination of 
variance inflation factors (VIFs) to check for collinearity. No 
variables needed removal, as all VIFs fell below a VIF value of 5. 
See Table 3 for descriptive information for predictor variables, 
Table 4 for correlations between predictors and TCC ratings, and 
Table 5 for regression results. Correlations between predictors are 
presented in Table 6. 

Consultation with key opinion leaders. The total regression 
model for the key opinion leader condition was not significant 
(R? = .092), F(10, 428.88) = 1.32, p = .225. Number of years 
employed as a full-time teacher (8 = —.205), 11151) = —2.32, p = 
.022, was the only significant predictor of key opinion leader 
ratings. When key opinion leader consultation was offered, teach- 
ers who have less experience provided higher likelihood of adop- 
tion ratings than teachers who have more experience. A follow-up 
median split analyses indicated that teachers with less than or 
equal to 15.50 years of experience provided higher ratings of 
adoption with key opinion leader support (MV = 83.82, SD = 
16.97) than did teachers with more than 15.50 years of experience 
(M = 75.20, SD = 19.42), (150) = 2.91, p = .004. 

Performance feedback. The total regression model for the 
performance feedback condition was not significant (R* = .116), 
F(10, 701.72) = 1.70, p = .087. The Sense of Personal Accom- 
plishment subscale from the MBI-ES (8 = .222), #1151) = 2.28, 
p = .024, was the only significant predictor. When consultation 
with performance feedback was offered, teachers who had a higher 
sense of personal accomplishment (e.g., “I feel I’m positively 
influencing other people’s lives through my work’) reported 
higher likelihood of adoption ratings than teachers who had a 
lower sense of personal accomplishment. The Sense of Personal 
Accomplishment subscale of the MBI-ES yields three categories 
of personal accomplishment: low, moderate, and high. Using a 
one-way ANOVA, average likelihood of adoption rating differ- 
ences were examined between those participants who fell into each 
category. Results indicated significant differences in mean perfor- 
mance feedback rankings between conditions, F(2, 153) = 6.90, 
p = .001. Mean rankings were highest for those high in personal 
accomplishment (n = 90; M = 79.83, SD = 21.86), followed by 
those with a moderate (n = 46; M = 73.91, SD = 18.91) and low 
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Table 2 
Distribution of Rankings for Each TCC 








Support technique First Second Third Fourth 
Key opinion leader 52.6% 25.0% 14.5% 7.9% 
Performance feedback 27.0% 32.9% 25.0% 15.1% 
Motivational interviewing 9.2% 30.4% 36.2% 23.7% 
PD-as-usual 11.2% 11.8% 23.7% 53.3% 


Note. TCC = training and consultation condition; PD = professional 
development. 


sense of personal accomplishment (n = 20; M = 61.00, SD = 
21.74). 

Consultation with motivational interview. There were no 
significant predictors in the model for the motivational interview- 
ing condition. 

PD-as-usual. There were no significant predictors in the 
model for PD-as-usual. 


Discussion 


As hypothesized, both ratings and rankings of adoption likeli- 
hood suggest that teacher perceptions of rates of adoption of an 
intervention for students with disruptive behavior may be signifi- 
cantly enhanced if intervention training and support includes con- 
sultation with key opinion leaders, consultation with performance 
feedback, or consultation with motivational interviewing com- 
pared with PD-as-usual. Further, the highest likelihood of inter- 
vention adoption was reported when teachers were provided with 
descriptions of consultation with key opinion leader and perfor- 
mance feedback. In partial support of theories of adoption and 
implementation (Fixsen et al., 2009; Rogers, 2003), some teacher- 
level predictors of reported adoption were identified. Below, im- 
plications for future research and practice are discussed. 


Consultation With Key Opinion Leaders 


According to theories of innovation diffusion (Rogers, 2003), 
key opinion leaders are important determinants of the potential 
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spread of an innovation (i.e., intervention) throughout an environ- 
ment. Namely, respected, socially central figures in the school 
environment are conceptualized as models of behavior for others 
in the social network, and considerable weight is given to their 
opinions and experience. Consistent with this theory, as well as 
other recent studies (Atkins et al., 2008; Cunningham et al., 2009), 
nearly 80% of teachers (see Table 2) reported the key opinion 
leader condition to be the first or second most likely TCC to result 
in DRC adoption, with a mean rating of 79.39 across all teachers. 
Though this figure likely overstates the number of teachers who 
would adopt the intervention if provided a real opportunity, this 
perception may be an important indicator of true behavior and may 
also relate to sustained implementation (i.e., Baker et al., 2010). 

These data, and those of Atkins and colleagues (2008), provide 
substantial evidence that teachers trust the opinions of respected 
colleagues in their network and feel comfortable receiving consul- 
tation and support from them. Indeed, this is the likely natural 
mechanism through which many teachers obtain advice and rec- 
ommendations about both instructional and behavioral interven- 
tions, particularly in underresourced schools that lack consultants 
or school mental health professionals. Given the access that teach- 
ers have to key opinion leaders and the natural transfer of infor- 
mation that occurs within this social network, training key opinion 
leaders in evidence-based programs (in a manner that aligns with 
evidence-based PD models; Darling-Hammond et al., 2009) may 
be an effective catalyst for adoption by others. 

It is important to note that our data and that of Atkins and 
colleagues (2008) are based on teacher self-report rather than 
actual observed intervention adoption. Though there is evidence to 
suggest that teacher perceptions of interventions, prior to training, 
relate to teachers’ sustained implementation of interventions, these 
findings must be replicated (Baker et al., 2010). If replicated, 
schools could develop procedures for identifying the key opinion 
leader teachers who are respected for their skills (e.g., in behavior 
management) and develop a process for keeping them current in 
evidence-based practices and disseminating this information 
throughout the network of teachers. It is also important to note that 
the characteristics ascribed to key opinion leaders (e.g., well- 


Descriptive Statistics for Support Condition Rating Predictors 





Teacher factor 





Experience 
Years of teaching experience 
Burnout 
MBI-ES Emotional Exhaustion 
MBI-ES Depersonalization 
MBI-ES Sense of Personal Accomplishment 
Self-efficacy 
OSTES Student Engagement 
OSTES Instructional Strategies 
OSTES Classroom Management 
Classroom management 
IBMAS total score 
Principal support 
Principal support in general 
Principal support of individual 


M SD Minimum Maximum 
16.03 9.78 1 43 
20.96 10.67 0 47 

4.43 4.45 0 20 
38.58 6.48 14 48 

6.99 1.16 4 9 

VES 99 4 9 

132 1.09 4 9 

123852, 21.46 60 170 

3.40 82 1 6 

356 88 1 6 


Note. MBI-ES = Maslach Burnout Inventory—Educators Survey, OSTES = Ohio State Teacher Efficacy 
Scale; IBMAS = Instructional and Behavior Management Approaches Survey. 
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Correlations Between Teacher Factors and TCC Ratings 
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Experience 
Years of teaching experience 
Burnout 
MBI-ES Emotional Exhaustion 
MBI-ES Depersonalization 
MBI-ES Sense of Personal Accomplishment 
Self-efficacy 
OSTES Student Engagement 
OSTES Instructional Management 
OSTES Classroom Management 
Classroom Management 
IBMAS total 
Principal support 
Principal support of school 
Principal support of individual 


Note. 
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TCC = training and consultation condition; PF = performance feedback; KOL = key opinion leader; 


MI = motivational interviewing; PDAU = professional development-as-usual; MBI-ES = Maslach Burnout 
Inventory—Educators Survey; OSTES = Ohio State Teacher Efficacy Scale; IBMAS = Instructional and 


Behavior Management Approaches Survey. 
p05 ip = 00 


respected, available, experienced, skilled) might also be found in 
other professionals. Future studies may benefit from teasing apart 
what aspects of key opinion leaders are most important for adop- 
tion and to what extent these characteristics appear to be differ- 
entially associated with the individual characteristics of the person 
versus those inherent to the position as a peer teacher. It is also 
important to note that although key opinion leader teachers may 
enhance initial adoption behavior, there is no evidence that their 
impact extends to enhancing implementation integrity once the 
intervention is adopted. Indeed, enhancing integrity may exceed 
the reach, occupational responsibility, and skill set of a key opin- 
ion leader. To date, only intensive observation, coaching, and 
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performance feedback has evidence for enhancing implementation 
integrity (Domitrovich et al., 2008). 


Consultation With Performance Feedback 


Given the substantial evidence for performance feedback in 
enhancing implementation integrity, it is encouraging to see that 
teachers perceive consultation with performance feedback as more 
likely to result in adoption than procedures that have not been 
shown to result in behavior change (i.e., PD-as-usual). To our 
knowledge, no study has previously reported on teachers’ percep- 
tions of performance feedback before teachers had been exposed to 


Summary of Regression Analyses for Variables Predicting TCC Ratings 








Performance Key opinion Motivational Professional development- 
feedback leader interviewing as-usual 
Teacher factor R? B Re B Re B R? B 

Regression model .116 092 .076 .099 
Experience 

Years of teaching experience = 093 — .205* .059 == 2 
Burnout 

MBI-ES Emotional Exhaustion —.092 .075 075 =.059 

MBI-ES Depersonalization .006 —.060 —.020 —.080 

MBI-ES Sense of Personal Accomplishment 222; 147 .190 151 
Self-efficacy ; 

OSTES Student Engagement 116 105 138 ASS 

OSTES Instructional Management 019 Stel Sail 330) —.093 

OSTES Classroom Management SOS 20) —.082 —.050 
“Classroom management 

IBMAS total .020 044 .020 .078 
Principal support 

Principal support in general .120 —.009 O72 = lS 

Principal support of individual =. 192 —.078 —.058 .070 


Ne ecg a Sein nN ee ee 
Note. TCC = training and consultation condition; 8 = standardized beta; MBI-ES = Maslach Burnout Inventory-Educators Survey, OSTES = Ohio 
State Teacher Efficacy Scale; IBMAS = Instructional and Behavior Management Approaches Survey. 
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Table 6 
Correlations Between Teacher Factors 


Variable 1 
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observation and feedback procedures. Our data offer promise that 
consultation with performance feedback is viewed positively by 
teachers and may enhance initial adoptions decisions as compared 
with motivational interviewing and PD-as-usual conditions. Fur- 
ther, combining findings from this study and previous studies of 
performance feedback, examining a combined consultation proto- 
col that includes both performance feedback and key opinion 
leaders may be a fruitful avenue for developing a system that 
impacts both initial adoption decisions and sustained implementa- 
tion of interventions. Future studies could experimentally manip- 
ulate the use of key opinion leader and performance feedback 
procedures alone and in combination and examine the potentially 
differing impacts on teachers’ actual adoption of the DRC and the 
integrity with which it is implemented over time. 

It is interesting to consider the characteristics that separate the 
key opinion leader and performance feedback conditions from the 
motivational interviewing and PD-as-usual conditions that were 
rated and ranked significantly lower in producing adoption likeli- 
hood. The key opinion leader and performance feedback condi- 
tions were the only conditions in which consultation was described 
as being provided throughout the entire year, potentially suggest- 
ing that teachers recognize the need and/or benefits of ongoing 
support, consultation, and feedback, and that such ongoing support 
is an important determinant of teacher adoption decisions. Though 
further studies deconstructing which aspects of each of the TCCs 
are associated with rates of adoption are needed, this hypothesis 
aligns with best practices for PD (Darling-Hammond et al., 2009; 
Yoon et al., 2007). Future research should also examine the con- 
gruency between teacher reports of adoption and actual adoption 
behavior, as this finding may bolster researcher and consumer 
confidence in offering performance feedback as a mechanism for 
enhancing intervention adoption and using performance feedback 
throughout the year to sustain high-quality intervention implemen- 
tation. 

Finally, it is important to note that even though there is sub- 
stantial support for the use of performance feedback, there is not 
agreement on the dosage or length of performance feedback that 
may be necessary or sufficient to produce sustained behavior 
change. Several studies document that brief versions of perfor- 
mance feedback can produce positive short-term outcomes. How- 
ever, few studies have examined the long-term maintenance of 


improvements in the quality of implementation, and a precipitous 
decline in quality implementation has been observed once the 
performance feedback is removed in some studies (e.g., Noell et 
al., 1997). Thus, performance feedback across an extended period 
or intensive performance feedback with periodic “boosters” may 
be necessary for sustained implementation. 


Consultation With Motivational Interviewing 


Although motivational interviewing was rated and ranked as 
less likely to result in DRC adoption than key opinion leader or 
performance feedback TCCs, our data suggest that interventions 
with a motivational interviewing component may still be a fruitful 
area for continued study. Indeed, teacher report suggests that 
consultation with motivational interviewing is perceived as signif- 
icantly more likely to result in adoption than PD-as-usual. Re- 
searchers have yet to develop a consensus on how best to incor- 
porate motivational interviewing principles in school-based 
consultation, but a variety of adaptations have been proposed (Frey 
et al., 2013; Reinke et al., 2012), and studies are currently under- 
way to examine the extent to which the use of motivational 
interviewing in consultation enhances intervention adoption and 
implementation (Owens & Coles, 2014; Reinke, Frey, Herman, & 
Thompson, 2014). Given that 10% of teachers ranked consultation 
with motivational interviewing as their first choice of support, this 
TCC may be particularly helpful for a small subset of teachers, 
particularly those who may be ambivalent about the intervention. 
Because none of the teacher-level predictors in this study had 
significant utility in identifying this subset of teachers, research 
with different constructs such as attitudes toward evidence-based 
practices or the consultation process is warranted. In addition, 
future research could examine what, if any, incremental effect 
motivational interviewing procedures have, above and beyond 
performance feedback and key opinion leader approaches. 


PD-as-Usual 


As hypothesized, the PD-as-usual condition received the 
lowest rankings and ratings of the likelihood of DRC adoption. 
Although this finding was expected, this outcome is concerning, 
given that this style of PD is the most common method of 
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training available to teachers (Darling-Hammond et al., 2009). 
With less than 25% of teachers in this study ranking PD-as- 
usual as either first or second most likely to result in interven- 
tion adoption (Table 2), it would appear that the standard 
training and consultation options are perceived as largely in- 
sufficient, and indeed may explain the relatively low rates of 
adoption of evidence-based classroom interventions for youth 
with disruptive behaviors (Martinussen et al., 2011). 


Predictors of Adoption 


Although teacher-level factors (i.e., experience and personal 
accomplishment) show some promise in identifying teachers who 
are in need of further support or “matching” to a specific TCC, 
teacher-level factors explained a small amount of variance in 
likelihood of adoption ratings. Specifically, across TCC condi- 
tions, only two significant predictors of likelihood of adoption 
ratings emerged. First, results indicated that a higher sense of 
personal accomplishment was linked to higher adoption likelihood 
ratings in the performance feedback condition. This finding may 
indicate that teachers who have a higher confidence in their ability 
to influence student outcomes are more comfortable being ob- 
served, receiving feedback about their performance, and are more 
welcoming of consultation than those who question their impact on 
student outcomes. Second, more years of teacher experience was 
predictive of lower adoption likelihood ratings in the key opinion 
leader condition. This finding could be explained by a number of 
potential phenomena. First, it is possible that the more senior 
teachers are key opinion leaders themselves. Thus, others turn to 
them for support and advice, but they do not have others in the 
network to whom they turn for support and advice. Second, 
more senior teachers may feel as if they have already accumu- 
lated the knowledge and skills necessary to effectively manage 
disruptive behavior; thus, the presence of a key opinion leader 
may be less influential in their adoption decision. In contrast, 
the presence of a key opinion leader and a supportive collegial 
network may be incrementally important for more junior teach- 
ers, as there is evidence that teacher induction programs and 
collaborative environments are associated with teacher deci- 
sions to remain in the field and occupational satisfaction in 
young teachers (Kardos et al., 2001; Smith & Ingersoll, 2004). 
Thus, consultation with a key opinion leader may be a partic- 
ularly attractive and important training and consultation option 
for new or less experienced educators (Shernoff et al., 2011). 
However, because eight of 10 predictors in the model were not 
significant, and because the significant predictors explained 
only a small portion of the variance in adoption ratings, the 
practical meaning of this finding may be limited. In addition, 
examination of the predictive utility of other teacher-level fac- 
tors (e.g., attitude variables) may be more fruitful. 


Limitations and Future Directions 


First, this study is limited by its use of an analog design. 
Although steps were taken to ensure that all descriptions validly 
represented the intended constructs, it is unknown how these 
findings generalize to actual adoption of classroom interventions. 
However, evidence of a moderate correlation between self- 
predictions and actual behavioral enactment of intentions (Armit- 


age & Conner, 2001), and the negative relationship between 
teacher report of concerns and participation in an intervention 
(Baker et al., 2010), offer some confidence that our results gener- 
alize to actual behavior. Second, all questionnaires and vignettes 
were specific to the teacher level. As schools are nested within 
school systems, and are influenced by state and federal education 
policy, future research should explore how these multilevel factors 
influence the adoption decisions about interventions for students 
with disruptive behaviors. 

Third, it may be considered a limitation that TCCs were exam- 
ined in isolation, rather than considering combinations of TCCs, as 
a combination may result in the highest likelihood of teacher 
adoption. Because this was the first study to directly compare 
multiple TCCs, it was important to examine parsimonious models 
first, and evaluate our hypotheses before examining more complex 
combinations. Additionally, this study did not explicitly test the 
impact of verbal consultation as compared with a more active 
“coaching” model of consultation-focused skill development 
through live coaching (e.g., “bug in the ear’’) or role-plays. Fourth, 
measurement of teacher-level predictors was potentially limited by 
the limited psychometric strength of measurement instruments 
used. Lastly, our sample was largely homogenous with regard to 
race, ethnicity, gender, and school type (i.e., rural), limiting the 
generalizability of the findings. Future research should replicate 
the current study with diverse populations. Finally, though efforts 
were taken to validate TCC descriptions, it is possible that the 
writing style and attention given to specific aspects of each TCC 
influenced teacher response. For example, responses may have 
differed had more explicit descriptions of time spent in consulta- 
tion or magnitude of expected behavior change been operationally 
defined. 


Conclusion 


Children with disruptive behaviors have a number of functional 
impairments in the school setting that contribute to poor academic, 
social, and behavioral outcomes; teacher distress; and costly ser- 
vices. Although evidence-based individualized programs that have 
the potential to effectively treat disruptive behavior are available 
(Pelham & Fabiano, 2008), these interventions are seldom adopted 
and used outside of research contexts (Martinussen et al., 2011). 
Thus, for evidence-based classroom interventions to have their 
intended impact, mechanisms for enhancing teachers’ adoption 
and high-quality implementation of such interventions must be 
identified. Results from this study offer new insights for research 
in this area. 
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Appendix 


ISQ Construction and Validation Information 


Validation of the Intervention Support 
Questionnaire (ISQ) 


To ensure that vignette descriptions accurately described 
intended constructs and procedures, vignette validation was 
conducted in a two-step procedure. After we constructed pre- 
liminary descriptions of the child with ADHD, the DRC inter- 
vention, and the four TCCs, feedback was obtained from grad- 
uate students and faculty members familiar with classroom 
interventions and teacher consultation for students with disrup- 
tive behaviors. This feedback guided the first iteration of revi- 
sions. During the second step of vignette validation, revised 
vignettes were distributed for review by established researchers 
who specialize in development and evaluation of school-based 
treatments for children with disruptive behaviors (n = 3) and 
educators currently employed in a local school district (n = 4). 
Research experts had over 60 combined years of research 
experience on school-based treatments for children with disrup- 
tive behaviors, had published over 200 peer-reviewed journal 
publications and have received over 20 federal grants for re- 
searching school-based interventions for students with behavior 
problems. Local educators included a schoo] mental-health pro- 
fessional with over 20 years of experience and three classroom 
teachers with over 30 years combined experience. All partici- 
pants in the second step of validation were asked to rate: (1) 
how severe the child’s behavior was on a scale from 0 (no 
problem) to 6 (extreme problem), (2) how disruptive the child’s 
behavior would be to a classroom environment on a scale from 
0 (no problem) to 6 (extreme problem), and (3) the extent to 
which the intervention was appropriate for the child’s behav- 
ioral difficulties on a scale from 0 (strongly disagree that this 
is an appropriate intervention) to 6 (strongly agree this is an 
appropriate intervention). Ratings indicated that the vignettes 
validly represent the goals described above. Namely, ratings of 
child severity (M = 4.28, SD = .48) and disruptiveness (M = 
4.42, SD = .53) fell in the intended range, and ratings of the 
intervention indicated that respondents found it to be an appro- 
priate intervention for the child’s difficulties (M = 5.29, SD = 
.76). 

Additionally, to ensure differentiation between TCC descrip- 
tions, researchers, but not educators, were asked to rate the 
extent to which each TCC’s description represented their pro- 
fessional conceptualizations of each of the TCCs on a scale 
from 0 (not at all) to 4 (to a great extent). A dummy condition 
label (“Consultation with Stress Reduction Techniques”) was 
also included so that raters could not simply “match” the 
descriptions to a label when making their ratings. These raters 
reported that the descriptions largely matched their professional 
conceptualizations of the TCC labels, providing further valida- 


tion for the’ ISQ. More specifically, 3 of 4 TCC descriptions 
received ratings of 4 for the intended label, and one received a 
mean rating of 3.33, indicating that our descriptions represented 
the intended TCC label to a great extent. Further, no TCC © 
description was rated greater than a 1.0 for any other TCC label, 
indicating that the TCC conditions were well differentiated. 


Intervention Support Questionnaire 


Observation and Performance Feedback Description 


To support you in using the new intervention, the counselor 
agrees to observe your classroom two times, per month and have 
meetings with you to discuss how the intervention is going. In these 
meetings she will help you figure out how to best deliver all the 
components of the intervention and troubleshoot problems specific 
to Sam and his response to the intervention. More specifically, she 
will review and graph the data on how Sam is progressing, discuss 
your strengths in the use of the intervention, discuss any ideas for 
improving intervention use and outcomes, and answer any ques- 
tions or concerns you may have. This would be available to you 
throughout the school year. 


Motivational Interviewing Description 


The counselor talks with you to obtain information about Sam’s 
behaviors and discuss the new intervention. In the process, you 
share concerns you have about implementing the intervention. You 
feel that your current schedule is full, and you have too many kids 
in the classroom to be able to provide special attention to any one. 
The counselor facilitates an interview and discussion that helps 
you explore your hesitancies. During this discussion you explain 
your hesitancies (e.g. I’d like to help Sam, but I am concerned 
about the time this intervention will take”; “I’m not sure I can 
handle one more thing on top of all the things I already do”). You 
then discuss the potential positives of the program: the classroom 
could be calmer, you could get back some of the instruction time 
you currently use to redirect Sam, and he may be less prone to 
arguing with you. Through this process you come to identify 
possible benefits of implementing the intervention and the school 
counselor agrees to be available to answer any more questions you 
may have or help generate ideas about how to deal with potential 
obstacles. 


Key Opinion Leader Description 


Before reading the vignette below, please think of a colleague 
whom you respect and whom you typically turn to for advice 
and/or guidance regarding disruptive student behavior. Insert this 
person’s name in the blank spaces as you read. 


(Appendix continues) 
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The counselor informs you that tested this intervention 
in his/her classroom last year and will meet with you to discuss the 
intervention. Because you respect this colleague’s opinion and 
experience, you then meet with and discuss the program. 
specifically points out strategies and techniques that 
worked particularly well for him/her in the past year and says the 
program helped a child in his/her class. If you decide to use this 
program, will be available informally as needed 
throughout the school year (e.g. before or after school, during 
planning periods) to discuss progress and help you to use the 
intervention. 





Training-as-Usual Description 


The mental health professional says that there is anew program 
which may be helpful for Sam. You are encouraged to attend the 
workshop to receive training. The workshop will be primarily 
PowerPoint based and taught through a two hour session. The 
session will include a lecture, a role play and a question and 
answer session at the end. You are given a manual that has 
information about the techniques, quotes from teachers about the 


program and some tips for putting it in use. You are also given a 
website which has information, as well as worksheet examples 
available for download. 

1. Now that you have read all the descriptions of consultation 
and support strategies, please rank the strategies in order of your 
preference. Place a “1” next to the support strategy that would 
make it most likely that you would use the intervention. Place a “2” 
next to the support description that would be second most likely, 
and so on. Please rank all four strategies. Assume that you have to 
use support strategies, even if your answer on previous questions 
was that you would never use the intervention for any strategy. 

Rank 

Support Strategy A 
Support Strategy B 
Support Strategy C 
Support Strategy D 
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Adequate sleep is essential for child learning. However, school systems may inadvertently be promoting 
sleep deprivation through early school start times. The current study examines the potential implications 
of early school start times for standardized test scores in public elementary schools in Kentucky. 
Associations between early school start time and poorer school performance were observed primarily for 
schools serving few students who qualify for free or reduced-cost lunches. Associations were controlled 
for teacher-student ratio, racial composition, and whether the school was in the Appalachian region. 
Findings support the growing body of research showing that early school start times may influence 
student learning but offer some of the first evidence that this influence may occur for elementary school 


children and depend on school characteristics. 


Keywords: sleep, start time, school performance, free lunch 


Adequate high-quality sleep is important for the daytime func- 
tioning of children (Paavonen et al., 2000). Consequences of 
inadequate sleep include irritability, emotional dysregulation, im- 
pulsivity, difficulties with attention, and poorer cognitive perfor- 
mance (Curcio, Ferrara, & De Gennaro, 2006). It is therefore 
important to understand factors that may hinder child sleep. For 
children, wake times are partially determined by school start times; 
to attend school, children must wake early enough to get ready and 
be transported to the school (Wolfson, Spaulding, Dandrow, & 
Baroni, 2007). By curtailing the sleep period, earlier school start 
times may reduce the amount of sleep children can obtain (Dexter, 
Bijwadia, Schilling, & Applebaugh, 2003) and lead to sleep de- 
privation. Thus, early school start times may indirectly lead to poor 
school performance by causing sleep deprivation (Dworak, Schi- 
erl, Bruns, & Struder, 2007). However, a large scale investigation 
of the potential impact of public school start times on academic 
achievement is lacking, and very little research has examined the 
impact of start times for elementary school students. The purpose 
of the current study is to address these gaps by examining asso- 
ciations between public elementary school start times and school 
performance measures in the public schools of Kentucky. 
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Sleep problems have been linked to poor school performance 
and low attendance rates (Sadeh, Gruber, & Raviv, 2003). For 
example, sleep quality and quantity in school children are related 
to declarative and procedural learning (Curcio et al., 2006). Day- 
time sleepiness is associated with executive functioning problems 
such as poor concentration and difficulty focusing attention (An- 
derson, Storfer-Isser, Taylor, Rosen, & Redline, 2009; Buckhalt, 
El-Shiekh, Keller, & Kelly, 2009; El-Sheikh, Buckhalt, Keller, 
Cummings, & Acebo, 2007). Shorter sleep duration is also linked 
to working memory capacity and memory consolidation (Kopasz 
et al., 2010), cognitive abilities that are very important for aca- 
demic performance. A recent meta-analysis of over a century of 
research demonstrated a small but reliable association between 
children’s longer sleep duration and better performance on cogni- 
tive tasks and higher academic achievement (Astill, Van der Hei- 
jden, Van JJzendoorn, & Van Someren, 2012). Another recent 
meta-analysis suggests that sleepiness and sleep duration are re- 
lated to child school performance (Dewald, Meijer, Oort, Kerkhof, 
& Bogels, 2010). Further, treatment of child sleep disorders is 
associated with improvements in attention (Chervin et al., 2006). 

Early school start times are a potential cause of child and 
adolescent sleep deprivation because they curtail the sleep period 
(Knutson & Lauderdale, 2009). There are now a number of studies 
documenting the link between early school start times and lower 
sleep amount and daytime sleepiness in adolescents (e.g., Dexter et 
al., 2003; Epstein, Chillag, & Lavie, 1998; Li et al., 2013; Wahl- 
strom, 2002). For example, a change in high school start times 
from 8:25 a.m. to 7:20 a.m. was associated with student sleep 
deprivation and greater daytime sleepiness (Carskadon, Wolfson, 
Acebo, Tzischinsky, & Seifer, 1998). Wolfson et al. (2007) exam- 
ined two middle schools, one starting classes at 7:15 a.m. (School 
E) and one starting at 8:37 a.m. (School L). Adolescents attending 
School E had significantly more daytime sleepiness and reported 
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37 fewer minutes of total sleep than adolescents attending School 
L. Further, adolescents attending School E had 4 times more 
tardies than those attending School L. Owens, Belon, and Moss 
(2010) examined a 30-min delay in start time at a private high 
school and observed a 45-min increase in average sleep duration, 
reduced percentage of sleep deprived students, and declines in 
daytime sleepiness. 

Sleep deficits associated with early school start times may 
translate into poor school performance. A 1-hr delay in middle 
school start times (8:30) was associated with improved student 
performance on tests of attention and impulsivity compared to 
students attending school at the regular time (7:30); these improve- 
ments disappeared after the experimental group returned to the 
normal start time (Lufi, Tzischinsky, & Hadar, 2011). When 
schools in Wake County, North Carolina, delayed school start 
times, Edwards (2012) compared student performance on stan- 
dardized tests of math and reading before (1999) and after (2006) 
the delay. A 1-hr delay in middle schools and high schools was 
related to improved test scores on math and reading (Edwards, 
2012). Effects were especially strong for students with lower test 
scores. Notably, this study found no effects of school start times on 
elementary school students’ performance. 

Despite the strengths of these prior research studies, there are 
some notable gaps in research on school start times and academic 
performance. First, the majority of prior studies have been case 
studies or studies of schools in only one school district (although 
see Li et al., 2013, for an exception). This makes it difficult to 
judge the widespread impact of school start times on academic 
performance. It also leads to the second gap in research: There is 
currently little understanding of how school start times relate to 
student performance in schools with differing characteristics. Few 
studies have examined moderators of the association between 
school start times and child or adolescent functioning, and none 
have examined socioeconomic status variables as moderators. Fi- 
nally, research has almost exclusively considered middle and high 
school students. School start times are proposed to be more influ- 
ential for adolescents because of biological changes in sleep—wake 
regulation associated with puberty (Crowley, Acebo, & Carska- 
don, 2007). On the basis of evidence that early school start times 
are harmful for adolescents, some school districts have chosen to 
push middle and high school start times later and make elementary 
school start times earlier to retain staggered busing strategies 
(Kirby, Maggi, & D’Angiulli, 2011). It is therefore critical to 
investigate the impact of early school start times on elementary 
school students. 

The current study addresses these research gaps. We examine 
associations between school start times and average standardized 
test scores for elementary schools in all public school districts in 
Kentucky. We chose not to include middle and high schools in our 
analysis because we found very little variability in middle and high 
school start times in Kentucky. We hypothesize that schools with 
earlier start times will have lower average student test scores and 
“poorer school performance. We also examine two school differ- 
ences as moderators of the association between school start time 
and student test scores: county designation as Appalachian and the 
percentage of students receiving free or reduced-cost lunches. 

The Appalachian region includes the vast majority of eastern 
Kentucky. Appalachian counties are known for their low economic 
status, including high poverty levels and very few job opportuni- 


ties (de Young, 1985). Although the Appalachian region has been 
improving in terms of academic performance and employment 
rates, it still lags behind non-Appalachian areas (Shaw, De Young 
& Rademacher, 2004; Wilson & Gore, 2009). For example, Ap- 
palachian counties have high school dropout rates that are double 
the national average (Laird, Cataldi, KewalRamani, & Chapman, 
2008), making them the lowest completion rates in the United 
States (Ziliak, 2012). Because Appalachian schools experience 
greater problems, they may be especially susceptible to the possi- 
ble effects of early school start times. We therefore hypothesize 
that associations between school start times and student test scores 
will be stronger for Appalachian school districts. 

School start times may also have an important impact in schools 
serving economically disadvantaged populations. There is a well- 
documented achievement gap between poor and middle class 
students, and this gap has been steadily increasing over the last 70 
years (H. F. Ladd, 2012). There are likely numerous reasons for 
this gap, including poorer student health, less access to high 
quality preschools, residential mobility or lack of mobility (e.g., it 
may be difficult for poor parents to move into areas with high 
quality schools), and the inability to afford expensive extracurric- 
ular activities that enhance cognitive development (Evans, 2004). 
Sleep may therefore be especially important for economically 
disadvantaged students (Buckhalt, 2011). A common indicator of 
poverty is eligibility for free or reduced-cost school lunch. We 
hypothesize that the association between school start times and test 
scores will be stronger for those schools with a higher percentage 
of students receiving free or reduced-cost lunches. 


Method 


Data were collected for all eligible public elementary schools in 
Kentucky. Schools were considered ineligible if they were voca- 
tional schools, alternative schools, schools that only included 
prekindergarten through the second grade (test data are not avail- 
able for these grades), private schools, special education schools, 
and schools in juvenile justice centers. Two elementary schools 
were removed from analyses because their start time was 1:40 p.m. 
We were unable to determine the start time for one elementary 
school. The resulting sample included 718 elementary schools. 

School start time data were collected via school websites or by 
calling the school office. Other variables were obtained via the 
Kentucky Department of Education website (http://education.ky 
.gov). Variables included in the study are listed below. Data are 
from the 2011-2012 school year (Kentucky Department of Edu- 
cation, 2011, 2012). Means and standard deviations are provided in 
Table 1. 

School start times. 
since midnight. 

Novice, Apprentice, Proficient, Distinguished (NAPD) 
scores. Each school had scores evaluating student performance 
on the Kentucky Performance Rating for Educational Progress 
(K-PREP) assessment in each of the following domains: reading, 
mathematics, science, social studies, and writing. These scores are 
referred to as NAPD scores because they were based on the 
percentages of children classified as novice, apprentice, proficient, 
and distinguished, based on cutoff scores (see http://www.education 
._ky.gov for details). K-PREP exams were administered in third and 
fourth grades. The possible range of the K-PREP scores was 0-30 for 


Start times were computed as minutes 
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Table 1 
Elementary 
Variable M (SD) 

Start time 8:05 AM (35 min) 
Minimum 7:00 AM 
Maximum 9:10 AM 

Schools starting at: 
7:00-7:19 1 (0.1%) 
7:20-7:59 350 (48.7%) 
8:00-8:29 224 (31.2%) 
8:30-8:59 41 (5.7%) 
9:00-9:10 102 (14.2%) 


NAPD Language 
NAPD Reading 
NAPD Math 
NAPD Writing 


66.24 (17.85) 
62.01 (13.36) 
60.45 (13.40) 
56.78 (12.46) 
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NAPD Science 88.58 (13.21) 
NAPD Social Studies 78.47 (14.96) 
Attendance rate 95.20 (1.23) 
Retention rate 0.437 (.949) 
Graduation rate 

College transition rate 

Student-teacher ratio 15.30 (2.11) 


each subject at each grade, but cutoff scores differed by subject and 
grade. Table 2 presents details regarding cutoffs for classifications and 
grades in which the tests were administered. NAPD scores were 
computed as follows: Schools received 1 point for every percentage 
point of students scoring proficient or distinguished (for a maximum 
score of 100); half a point was awarded for each percentage point of 
students scoring apprentice. NAPD scores are therefore continuous, 
and higher scores represent better school performance. 

School rank. This variable is the percentile rank of a school 
based on overall school performance, ranging from 0 to 100. 
Higher percentile rank indicates better school performance. 
Schools are ranked against other schools of their level (e.g., other 
elementary schools). 

Attendance rate. Schools provided the percentage of enrolled 
students in attendance for every school day to the Kentucky 
Department of Education. The attendance rate is the average 
attendance percentage across the entire school year. 

Retention rate. The retention rate is the percentage of a 
school’s students who have been required to repeat a grade. 

Appalachian county (APPALACHIAN). This variable iden- 
tifies whether the school is located in a county that has been 
designated as Appalachian according to the Appalachian Regional 
Commission (http://www.arc.gov/about/index.asp). Fifty-four of 
the 120 counties in Kentucky are designated as Appalachian. 

Free and reduced-cost lunches (FREELUNCH). This is the 
percentage of students in the school receiving free or reduced-cost 
lunches. 

Teacher-student ratio (TSRATIO). The variable reflects 
the average number of students per teacher. 

Percentage African American (AFRICAN AMERICAN). 
The percentage of students who are African American in a 
given school is reflected in this variable. The average percent- 
age across all elementary schools was 9.14% (SD = 14.56%) 
and ranged from 0.0% to 76.0%. However, 65% of schools were 
5% or less African American. Only 2.9% of schools served a 


Middle High 
M (SD) M (SD) 
8:00 AM (20 min) 8:01 AM (18 min) 
7:20 AM 7:20 AM 
9:05 AM 9:05 AM 
0 0 
151 (45.3%) 90 (39.0%) 
150 (45.1%) 121 (52.4%) 
22 (6.6%) 17 (7.4%) 
10 (3%) 3 (1.2%) 
30.77 (54.02) 66.06 (26.42) 
58.94 (14.65) 55.46 (20.01) 
58.67 (15.41) 48.40 (35.19) 
63.52 (16.90) 63.41 (18.54) 
74.69 (33.26) 46.99 (34.30) 
72.85 (33.48) 44.73 (32.51) 
94.06 (10.68) 93.27 (1.83) 
0.231 (7.79) 3.33 (3.29) 
71.03 (37.45) 
53.28 (25.56) 
15.09 (9.18) 14.91 (10.90) 


population of students in which the majority was African Amer- 
ican. 

Percentage Hispanic (HISPANIC). The percentage of stu- 
dents who are Hispanic in a given school is reflected in this 
variable. The average percentage across all elementary schools 
was 4.70% (SD = 6.68%). However, 71.3% of schools were 5% or 
less Hispanic. Only two schools (< 1%) served a population of 
students in which the majority was Hispanic. 


Data Analyses 


Because schools were nested within county (in Kentucky, there 
is one school district for each county), schools within the same 
county were not independent of each other and multilevel model- 
ing was required for data analysis (see Raudenbush & Bryk, 2002 
for a detailed overview of this statistical procedure). Multilevel 
modeling for nested data and similar procedures are common in 
educational research (e.g., Dettmers, Trautwein, Ludtke, Kunter, & 
Baumert, 2010; Goddard & Goddard, 2001; Shen, Leslie, Spy- 
brook, & Ma, 2012; Wenglinsky, 2002), including research on 
school start times (Edwards, 2012). In multilevel modeling, 
within-county variability is partitioned from between-county vari- 
ability. At Level 1, the within-county level, dependent variables 
(e.g., NAPD scores) for schools (I) in counties (J) are modeled as 
a function of an intercept (B;; the expected value of the dependent 
variable when there are scores of zero on the independent variables 
included in the Level 1 model) and the effects of independent 
variables that vary from school to school within the same county 
(e.g., school start times; B;,): 


NAPDMATH,, = By + By, (STARTTIME)) 
+ By (FREELUNCH,) + By; (TIMEXLUNCH,) 
+ By, (AFRICAN AMERICAN,) 


+ By; (HISPANIC) + By (TSRATIO)). 


Table 2 


SCHOOL START TIMES 239 


Per Grade Administration of Standardized Tests and Total Score 


Ranges Per Student Classification 
aa a ae 











Subject Novice Apprentice Proficient Distinguished 

Grade 3 

Reading 0-8 9-16 17-23 24-30 

Mathematics 0-9 10-16 17-24 25-30 
Grade 4 

Reading 0-8 9-16 17-23 24-30 

Mathematics 0-8 9-16 17—23 24-30 

Science 0-9 10-17 18-24 25-30 

Language Mechanics 0-9 10-17 18-25 26-30 
Grade 5 

Reading 0-9 10-17 18-24 25-30 

Mathematics 0-8 9-16 17-25 26-30 

Social Studies 0-10 11-18 19-25 26-30 

Writing 0-9 10-17 18-24 25-30 
Grade 6 

Reading 0-9 10-17 18-24 25-30 

Mathematics 0-8 9-15 16-25 26-30 

Writing 0-9 10-17 18-24 25-30 

Language Mechanics 0-9 10-17 18-24 25-30 
Grade 7 

Reading 0-9 10-14 0) 21-30 

Mathematics 0-8 9-14 15-22 23-30 

Science 0-9 10-16 17-21 22-30 
Grade 8 

Reading 0-8 9-15 16-21 22-30 

Mathematics 0-9 10-15 16-22 23-30 

Social Studies 0-9 10-17 18-25 26-30 

Writing 0-10 11-18 19-25 26-30 

Grade 9 : 

Reading 0-9 10-17 18-24 25-30 
Grade 10 

Mathematics 0-9 10-15 16-22 23-30 

Writing 0-9 10-17 18-24 25-30 

Language Mechanics 0-9 10-17 18-24 25-30 
Grade 11 

Writing 0-8 9-16 17-24 25-30 

Science 0-9 10-15 16-22 23-30 
Grade 12 

Social Studies 0-9 10-18 19-25 26-30 





The above equation illustrates that we examined associations be- 
“tween start times and school performance, controlling for teacher— 
student ratio, percentage of students identified as African Ameri- 
can, and percentage of students identified as Hispanic. Coefficients 
for the independent variables are interpreted in essentially the 
same way as regression coefficients. Interactions between Level 1 
variables can be entered (B;3) and indicate whether level one 
coefficients vary based on the values of other Level 1 variables. 

In essence, each county has its own regression equation. At Level 
2, the between-county level, each of the coefficients at Level 1 is 


modeled as a linear function of an intercept (e.g., 7719; the expected 
value of the Level 1 coefficient for schools with values of zero on the 
other variables entered into the Level 2 equation) and the effects of 
independent variables that only vary from county to county and not 
within county (€.g., 19, designation of Appalachian county): 


Byo et5 ata 720 (APPALACHIAN;) 
By; a Mg te 79) (APPALACHIAN;) 


By = Ty 
Bg = 73 
By = 74 
Bys = M5 
By = Te. 


Coefficients for the Level 2 predictors in the top equation can be 
interpreted as the first-order effects of the Level 2 variables on the 
dependent variable. That is, the coefficient 79 in the top equation 
above represents the effect of Appalachian county designation on 
NAPD math scores. Coefficients for the Level 2 predictors of the 
other Level 1 coefficients can be interpreted as moderation effects: 
They provide information concerning whether the Level 1 coefficient 
varies based on between-county variables. That is, the coefficient 775, 
indicates whether the effect of school start times on NAPD math 
scores depends on whether the school is located in an Appalachian 
county. Level 2 independent variables could be added to any of the 
Level 2 models, but such effects were not of interest in the current 
study. The coefficients 71,, through 77,, therefore indicate the average 
effects across all counties of the percentage of students receiving free 
or reduced-cost lunches, the interaction between this variable and school 
start times, AFRICAN AMERICAN, HISPANIC, and TSRATIO, re- 
spectively. Estimates of coefficients and their standard errors are only 
provided at Level 2. Only unstandardized coefficients are presented. 

Separate models were fit predicting each NAPD subject score, 
school rank, .attendance rate, and retention rate. School rank is an 
ordinal variable. However, alternative modeling techniques for esti- 
mating nested ordinal variables is beneficial primarily when there are 
seven or fewer categories (Bauer & Sterba, 2011). School rank had 99 
different categories. We therefore use traditional multilevel modeling 
for these data. All continuous independent variables were mean cen- 
tered before computing cross products. Designation of county as 
Appalachian (APPALACHIAN) was a dummy variable coded as 0 
for non-Appalachian and 1 for Appalachian. Separate models were 
also fit for interactions between school start times and either FREE- 
LUNCH or APPALACHIAN. Effects were considered significant if 
p <..05. Significant interactions were plotted at +1 SD from the mean 
for school start times and FREELUNCH or for Appalachian/non- 
Appalachian counties. Significant interactions were probed using on- 
line utilities available at http://www.quantpsy.org (Preacher, Curran, 
& Bauer, 2006). 


Results 


Interactions Between School Start Times 
and FREELUNCH 


Several significant interactions between elementary school 
start times and FREELUNCH were observed (see Table 3). The 
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Table 3 


Model Results for Interactions Between Elementary School Start Times and Fraction of Students Receiving Free or 


Reduced-Cost Lunches 





NAPD 4 
Attendance Retention 
Variable Language Reading Math Science Social Studies Writing School rank rate rate 
Intercept 
Intercept (774) 6845" 62875" 62.481" —90:430"" 80.10°™* SiGe 52103 7n- OSs Sines 036554 
APPALACHIAN (11,9) —9.126"** —6.863""* -—6.354"™" —8.814"" —6.288™ —4,963"* 16105) i lOO 0.313™* 
TSRATIO 
Intercept 52075 eNO Sim 673 Eso ae 1K2267-* .798"* eee 1080*** | 20408 
AFRICAN AMERICAN 
Intercept aaa — A127 ** 2 fees Sh aa —,413°" 324 =1.031""* — 005m —.001 
HISPANIC : 
Intercept —.4387™™ —A05i = aera Dame =A Sy Ol: Oo Ong —,.009** 
School Start Time . 
Intercept (7, ) .059* .038 .044* 017 .058** [055° isi .002 .002* 
FREE LUNCH 
Intercept (71>) = 0372 OS tis Ola 001 —.248 ars —.602 —.009 —.015 
Start Time X LUNCH 
Intercept (7,3) SOP 10) Sine ae —.010° 010i OSes 29 ae OO .000 


Note. Columns indicate the dependent variable being predicted. Statistical notation provided in parentheses corresponds to the equations provided in the 


analysis section. 


pee OS a Olean pee O01 


interaction predicted NAPD Language scores, 7,3 = —.017, p < 
.05; NAPD Reading scores, 7,, = -.015, p < .001; NAPD 
Science scores, 7,, = —.010, p < .05; NAPD Math scores, 
713 = —.012, p < .05; NAPD Social Studies scores, 7,, = 
~.010, p < .01; NAPD Writing scores, 7,,; = -.013, p < .01; 
school rank, 7,3 = —.029, p < .001; and school attendance rate, 
713 = -.001, p < .05. 

Interactions were plotted and were all nearly identical (see 
Figures 1 and 2 for examples). Results of probing the interactions 
are also shown in Table 4. The first two rows show the simple 
slopes for the effect of school start time on the dependent variable 
(see column heading) for lower and higher values of FREELUNCH. 
The bottom two rows illustrate the expected difference in the 
dependent variable for schools starting 1 hr later than another 
school. In all cases, there was a significant association between 
school start times and school performance only for schools with a 
lower percentage of students receiving free or reduced-cost 
lunches (e.g., school with more middle and upper class students). 
The difference in NAPD scores associated with a 1-hr difference in 
school start time ranged from 3 to almost 7 points. A 1-hr differ- 
ence in school start time was associated with school rank improved 
by 14 percentile points, and an attendance rate that was .32 units 
higher. 


Interactions Between School Start Times and 
APPALACHIAN 


No significant interactions were observed. 


Main Effects of School Start Times 


Only one main effect of school start times that was not qualified 
by an interaction was observed. Later school start times were 
associated with higher retention rates, 7,, = .002, p < .01. Every 
additional minute later in the school start time increased retention 


rates by 0.2%. A 1-hr difference in school start time would 
therefore be related to a 12% difference in retention rate. 
Discussion 


Prior research has indicated an association between early school 
start times and less total sleep time, more daytime fatigue and 
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sleepiness, more school tardiness, and lower school academic 
performance (Epstein et al., 1998; Owens et al., 2010; Wahlstrom, 
2002; Wolfson et al., 2007). However, no study to our knowledge 
has studied these associations between school start time, atten- 
dance rates, and academic performance on a statewide level. The 
present study investigated relations between school start times and 
a number of school performance standards in public elementary 
schools in Kentucky. We had two main hypotheses: (a) Earlier 
school start times will be associated with lower standardized test 
scores, poorer attendance, higher retention rates, lower school 
rank, and school underperformance; and (b) earlier start times will 
be especially risky for school performance standards in more 
disadvantaged schools, including Appalachian schools and schools 
with a higher percentage of students receiving free or reduced-cost 
lunches. Unexpectedly, findings indicated the earlier school start 
times were related to lower school performance predominantly for 


Table 4 


elementary schools with fewer students receiving free or reduced- 
cost lunches. No differences in associations between Appalachian 
and non-Appalachian counties were observed. 

For those schools for which an association was found, earlier 
start times were related to poorer test scores, lower school rank, 
and more student absences. These findings are consistent with 
previous research (Epstein et al., 1998; Wahlstrom, 2002; Wolfson 
et al., 2007). The relationship between earlier start times and 
poorer academic performance may be explained by the physical, 
behavioral, and psychological ramifications of sleep deprivation. 
Earlier start times may lead to student sleep deprivation by placing 
constraints on the amount of sleep a child or adolescent is able to 
obtain (Dexter et al., 2003; Epstein et al., 1998; Wolfson & 
Carskadon, 1998; Wolfson et al., 2007). Students may therefore 
lose the ability to remain alert and focused in the classroom 
(Durmer & Dinges, 2005; Epstein et al., 1998). Sleep deprivation 
increases hyperactivity and behavioral dysregulation, impairing 
students’ academic functioning (Dworak et al., 2007; Beebe, 2011; 
Wolfson & Carskadon, 1998). Sleep problems are also associated 
with asthma (Kakkar & Berry, 2009), compromised cardiovascular 
health (Cappuccio, Cooper, D’Elia, Strazzullo, & Miller, 2011), 
gastrointestinal problems (Chen, Liu, Yi, & Orr, 2011), and re- 
duced effectiveness of the immune system (Bryant, Trinder, & 
Curtis, 2004; Irwin et al., 1996). Therefore, sleep deprivation 
resulting from early school start times may increase the frequency, 
severity, and duration of illness, resulting in increased rates of 
absenteeism. 

Findings clearly show that—at least for middle and upper class 
students— earlier school start times can be associated with poorer 
school performance in elementary schools. The implication is that 
research on school start times should not focus exclusively on 
adolescents. Sufficient sleep is of critical importance across de- 
velopment (Fallone, Owens, & Deane, 2002). According to the 
National Sleep Foundation 2004 Sleep in America Poll, more than 
25% of school-age children (first grade to fifth grade) obtain less 
than the recommended daily amount of sleep. Modern-day ele- 
mentary school children may be taking on additional responsibil- 
ities, extracurricular activities, and/or entertainment opportunities 
that delay regular weeknight bedtimes. The use of media by 
children (e.g., television, video games) has been identified as 
especially problematic for delaying bedtimes, increasing sleep 
onset latency, and decreasing the amount of total sleep time 


Results of Probing Interactions Between School Start Times and Percentage of Students Receiving Free or Reduced-Cost Lunches 
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Effects and differences of start times Language Reading Math 
” Estimated effect of school start times 
Schools with lower FREELUNCH posperd! 1 5 .088* 
Schools with higher FREELUNCH .003 012 
Difference in schools starting 1 hr apart 
Schools with lower FREELUNCH 6.90 6.23 3.01 
Schools with higher FREELUNCH 0.18 =O72) 


—.016 


—0.96 


NAPD 


— Attendance 
Science Social Studies Writing School rank rate 
.050* 084" 091" 098" 23CimE .002* 
.004 .025 .012 041 —.001 
5.03 5.48 5.90 14.01 0.32 
0.24 1.50 0.72 2.46 —0.06 
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Note. The first two rows show the simple slopes for the effect of school start time on the dependent variable (see column heading) for lower and higher 
values of the moderator (FREELUNCH). The bottom two rows illustrate the expected difference in the dependent variable for schools starting 1 hr later 


than another school. 
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obtained (National Sleep Foundation, 2011; Owens et al., 1999). 
As aresult, early school start times may affect student performance 
even before the puberty-related delay in sleep phase. 

Of particular concern is that the growing public support for 
delaying middle and high school start times is often at the expense 
of making elementary school start times earlier. Indeed, this has 
already occurred in two counties in Kentucky (Fayette and Jessa- 
mine; National Sleep Foundation, 2005a, 2005b). This is often 
done in order to preserve staggered bus scheduling (Kirby et al., 
2011). Our findings suggest that these policy changes may simply 
be shifting the problem from adolescents to younger children, 
instead of eliminating it altogether. On the one hand, elementary 
school children are not experiencing the puberty-related phase 
shift in sleep-wake regulation. Therefore, earlier bedtimes and 
improved sleep hygiene may more readily prevent sleep depriva- 
tion in this student group. Nevertheless, if parents do not alter their 
children’s sleep behavior in response to earlier start times, elemen- 
tary school performance may suffer, and these reductions in early 
student learning may have implications for academic achievement 
over the long term (G. W. Ladd & Dinella, 2009). On the other 
hand, making school start times later for all grade levels may be a 
feasible solution for some school districts (Kirby et al., 2011). 

The association between later start times and higher retention 
rates was unexpected and indicates that later school start times 
were associated with a greater number of children being held back 
a grade. To our knowledge, this is the first study to examine 
student retention in relation to school start times, and it is therefore 
difficult to draw firm conclusions about this finding. However, 
given that other indices of school performance were improved at 


later school start times, one possible explanation is that once the - 


average students begin to improve, students with learning difficul- 
ties have an especially hard time keeping up. Lagging further 
behind the majority of students may lead to retention. This expla- 
nation is somewhat consistent with the findings that later school 
start times tend to benefit only those schools that have more 
middle or upper class students. On the other hand, this finding is 
inconsistent with other research suggesting that students with the 
lowest scores benefit from later school start times the most (Ed- 
wards, 2012). 

Appalachian county designation did not moderate any associa- 
tions, although it was consistently related to poorer school perfor- 
mance. On the other hand, the percentage of students qualifying 
for free and reduced-cost lunch (based on family income and 
therefore a measure of low socioeconomic status) consistently 
moderated associations between school start times and school 
academic success. Significant relations between early school start 
times and poor school performance were found only for schools 
with a lower percentage of students qualifying for free and 
reduced-cost lunches (e.g., for schools with a wealthier student 
population). In other words, schools with economically disadvan- 
taged students were unlikely to show better school performance if 
their start times were later. This is inconsistent with recent policy 
proposals suggesting that later school start times are a promising 
mechanism for closing the achievement gap between poor and 
wealthy students (Jacob & Rockoff, 2011). 

This lack of improvement in poorer school systems may be 
explained through a cumulative risk model (Evans, 2004; Samer- 
off, Seifer, Barocas, Zax, & Greenspan, 1987). According to 
Dubow and Ippolito (1994), poverty may be one of the single 


greatest risk factors for student academic performance. According 
to the cumulative risk model, poverty influences child develop- 
ment because of the accumulation of multiple stressors that ac- 
company poverty (Sameroff et al., 1987). Indeed, poverty has been 
linked to a wide range of stressors in both the psychosocial and 
physical environments (Evans, 2004). For example, the psychos- 
ocial environment of poverty may be characterized by exposure to 
violence (Emery & Laumann-Billings, 1998), marital conflict or 
divorce (Liu & Chen, 2006), harsh and unresponsive parenting 
(Conger & Elder, 1994; Grant et al., 2003), low parental monitor- 
ing (Kilgore, Snyder, & Lentz, 2000), less cognitive stimulation 
(Hoff, Laursen, & Tardiff, 2002), less parental involvement in 
school systems (Benveniste, Carnoy, & Rothstein, 2003), schools 
with less highly trained teachers and greater violence (Clotfelter, 
Ladd, Vigdor, & Wheeler, 2006; Milam, Furr-Holden, & Leaf, 
2010), and changes in schools and residences (Herbers et al., 
2012). The physical environment of poverty may be characterized 
by exposure to toxins and parental smoking (Centers for Disease 
Control and Prevention, 2010; Legot, London, Rosofsky, & Shan- 
dra, 2012), noise (Evans & Kim, 2012), crowded housing condi- 
tions (Myers, Baer, & Choi, 1996), inadequate heat (Children’s 
Defense Fund, 1995), lack of air conditioning (Federman et al., 
1996), poor nutrition (Alaimo, Olson, Frongillo, & Briefel, 2001), 
and crumbling schools (National Center for Education Statistics, 
2000). 

The cumulative model of risk posits that no one specific risk 
factor is tied to child developmental outcomes. Rather, it the 
number of risk factors that predict developmental outcomes, in- 
cluding allostatic load, academic achievement, and mental health 
(Appleyard, Egeland, van Dulmen, & Sroufe, 2005). Several stud- 
ies now indicate that the presence of four or more risk factors 
conveys special risk for compromised development (Sameroff, 
Bartko, Baldwin, Baldwin, & Seifer, 1998). Children growing up 
in poverty are likely to experience this number of risks. Low 
income fourth graders have 35% more negative life events in a 
year than middle income fourth graders (Attar, Guerra, & Tolan, 
1994). Other studies report even larger discrepancies based on 
income; approximately 35% of children living in poverty—com- 
pared to only 5% in wealthier families—have six or more risk 
factors present in their lives (Liaw & Brooks-Gunn, 1994). The 
increased risk burden mediates the association between poverty 
and psychophysiological functioning and psychological stress (Ev- 
ans & English, 2002). 

The implication is that removing one risk factor may have little 
impact, unless it brings the child under the risk threshold. At the 
same time, there is an incremental influence over time: The longer 
one is exposed to the stresses and disadvantages associated with 
poverty, the greater the risk and the poorer the outcomes in 
psychological and cognitive domains (Lynch, Kaplan, & Shema, 
1997). The impact of later school start times for impoverished 
school children may therefore be too little, too late, for academic 
performance. Indeed, later school start times may not even im- 
prove sleep in poor children. There is an increased incidence of 
sleep problems in the context of poverty, perhaps because of less 
comfortable sleep surfaces and room temperatures, room sharing, 
noise, and poor sleep hygiene (Buckhalt & Staton, 2011). As such, 
a delay in school start times may not be sufficient to overcome the 
numerous other obstacles that children in poverty face, including 
obstacles to obtaining adequate sleep. 
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Limitations 


The current study did not assess sleep directly and did not 
differentiate different aspects of sleep. A meta-analysis about sleep 
and school performance has shown that different measures of sleep 
condition are related to school performance to differing extents: 
Sleepiness is most strongly related to school performance, fol- 
lowed by sleep quality and sleep duration (Dewald et al., 2010). 
Earlier school start time may jeopardize different facets of sleep, 
and further research is needed to differentiate these. The current 
study is also limited by its cross-sectional design and data from 
only one state. Although we controlled for a number of potential 
confounding factors, including the racial composition of the 
schools and teacher—student ratio, we cannot infer that early school 
start times were the cause of school performance measures. Find- 
ings may not generalize to other states, especially to states that 
have varying levels of poverty or more racial diversity than Ken- 
tucky. Finally, we used traditional estimation methods to predict 
school rank; this variable is a rank order variable, and the tradi- 
tional estimation procedure may yield somewhat inaccurate esti- 
mates. 

Despite these limitations, this study addresses some key gaps in 
the current literature on school start times. First, we demonstrate 
that there are associations between early school start times and 
school performance, particularly among elementary schools serv- 
ing middle and upper class students. Identifying school character- 
istics that moderate associations between school start times and 
school performance has rarely been done for this topic. Finally, we 
provide one of the very few examinations of school start times and 
test scores in elementary schools. Our findings indicate that early 
school start times may be just as detrimental for young children as 
they are for adolescents. 


References 


Alaimo, K., Olson, C. M., Frongillo, E. A., & Briefel, R. R. (2001). Food 
insufficiency, family income, and health in U.S. preschool and school- 
aged children. American Journal of Public Health, 91, 781-786. doi: 
10.2105/AJPH.91.5.781 

Anderson, B., Storfer-Isser, A., Taylor, H. G., Rosen, C. L., & Redline, S. 
(2009). Associations of executive function with sleepiness and sleep 
duration in adolescents. Pediatrics, 123, 701-707. doi:10.1542/peds 
.2008-1182 

Appleyard, K., Egeland, B., van Dulmen, M. H. M., & Sroufe, L. A. 
(2005). When more is not better: The role of cumulative risk in child 
behavior outcomes. Journal of Child Psychology and Psychiatry, 46, 
235-245. doi:10.1111/j.1469-7610.2004.00351.x 

Astill, R. G., Van der Heijden, K. B., Van Wzendoorn, M. H., & Van 
Someren, E. J. (2012). Sleep, cognition, and behavioral problems in 
school-age children: A century of research meta-analyzed. Psychologi- 
cal Bulletin, 138, 1109-1138. doi:10.1037/a0028204 

.. Attar, B., Guerra, N., & Tolan, P. (1994). Neighborhood disadvantage, 
stressful life events, and adjustment in urban elementary school children. 
Journal of Clinical Child Psychology, 23, 391-400. doi:10.1207/ 
$15374424jccp2304_5 

Bauer, D. J., & Sterba, S. K. (2011). Fitting multilevel models with ordinal 
outcomes: Performance of alternative specifications and methods of 
estimation. Psychological Methods, 16, 373-390. doi:10.1037/a0025813 

Beebe, D. W. (2011). Cognitive, behavioral, and functional consequences 
of inadequate sleep in children and adolescents. Pediatric Clinics of 
North America, 58, 649-665. doi:10.1016/j.pcl.2011.03.002 


Benveniste, L., Carnoy, M., & Rothstein, R. (2003). All else equal. New 
York, NY: Routledge-Farmer. 

Bryant, P. A., Trinder, J., & Curtis, N. (2004). Sick and tired: Does sleep 
have a vital role in the immune system? Nature Reviews Immunology, 4, 
457—467. doi:10.1038/nril369 ; 

Buckhalt, J. A. (2011). Insufficient sleep and the socioeconomic status 
achievement gap. Child Development Perspectives, 5, 59-65. doi: 
10.1111/).1750-8606.2010.00151.x 

Buckhalt, J. A., El-Sheikh, M., Keller, P. S., & Kelly, R. J. (2009). 
Concurrent and longitudinal relations between children’s sleep and 
cognitive functioning: The moderating role of parent education. Child 
Development, 80, 875-892. doi:10.1111/j.1467-8624.2009.01303 
.x19489909 

Buckhalt, J. A., & Staton, L. E. (2011). Children’s sleep, cognition, and 
academic performance in the context of socioeconomic status and eth- 
nicity. In M. El-Sheikh (Ed.), Sleep and development: Familial and 
socio-cultural considerations (pp. 245-264). New York, NY: Oxford 
University Press. doi:10.1093/acprof:oso/9780195395754.003.001 1 

Cappuccio, F. P., Cooper, D., D’Elia, L., Strazzullo, P., & Miller, M. A. 
(2011). Sleep duration predicts cardiovascular outcomes: A systematic 
review and meta-analysis of prospective studies. European Heart Jour- 
nal, 32, 1484-1492. doi:10.1093/eurheartj/ehr007 

Carskadon, M. A., Wolfson, A. R., Acebo, C., Tzischinsky, O., & Seifer, 
R. (1998). Adolescent sleep patterns, circadian timing, and sleepiness at 
a transition to early school days. Sleep: Journal of Sleep Research & 
Sleep Medicine, 21, 871-881. 

Centers for Disease Control and Prevention. (2010). Vital signs: Current 
cigarette smoking among adults aged = 18 years—United States, 2009. 
Morbidity and Mortality Weekly Report, 59, 1135-1140. 

Chen, C. L., Liu, T. T., Yi, C. H., & Orr, W. C. (2011). Evidence for altered 
anorectal function in irritable bowel syndrome patients with sleep dis- 
turbance. Digestion, 84, 247-251. doi:10.1159/000330847 

Chervin, R. D., Ruzicka, D. L., Giordani, B. J., Weatherly, R. A., Dillon, 
J. E., Hodges, E. K., .. . Guire, K. E. (2006). Sleep-disordered breathing, 
behavior, and cognition in children before and after adenotonsillectomy. 
Pediatrics, 117, 769-778. doi:10.1542/peds.2005-1837 

Children’s Defense Fund. (1995). The state of America’s children year- 
book 1995. Washington, DC: Author 

Clotfelter, C., Ladd, H. F., Vigdor, J., & Wheeler, J. (2006). High-poverty 
schools and the distribution of teachers and principals. North Carolina 
Law Review, 85, 1345-1379. 

Conger, R. D., & Elder, G. H. (1994). Families in troubled times. New 
York, NY: Aldine de Gruyter. 

Crowley, S. J., Acebo, C., & Carskadon, M. A. (2007). Sleep, circadian 
rhythms, and delayed phase in adolescence. Sleep Medicine, 8, 602-612. 
doi:10.1016/j.sleep.2006.12.002 

Curcio, G., Ferrara, M., & De Gennaro, L. (2006). Sleep loss, learning 
capacity and academic performance. Sleep Medicine Reviews, 10, 323- 
337. doi:10.1016/j.smrv.2005.11.001 

Dettmers, S., Trautwein, U., Ludtke, O., Kunter, M., & Baumert, J. (2010). 
Homework works if homework quality is high: Using multilevel mod- 
eling to predict the development of achievement in mathematics. Journal 
of Educational Psychology, 102, 467—482. doi:10.1037/a0018453 

Dewald, J. F., Meijer, A. M., Oort, F. J., Kerkhof, G. A., & Bogels, S. M. 
(2010). The influence of sleep quality, sleep duration and sleepiness on 
school performance in children and adolescents: A meta-analytic review. 
Sleep Medicine Reviews, 14, 179-189. doi:10.1016/j.smrv.2009. 10.004 

Dexter, D., Bijwadia, J., Schilling, D., & Applebaugh, G. (2003). Sleep, 
sleepiness and school start times: A preliminary study. Wisconsin Med- 
ical Journal, 102(1), 44-46. 

de Young, A. J. (1985). Economic-development and educational status in 
Appalachian Kentucky. Comparative Education Review, 29(1), 47—67. 
doi:10.1086/446488 


244 KELLER, SMITH, GILBERT, BI, HAAK, AND BUCKHALT 


Dubow, E. F., & Ippolito, M. F. (1994). Effects of poverty and quality of 
the home environment on changes in the academic and behavioral 
adjustment of elementary school-age children. Journal of Clinical Child 
Psychology, 23, 401-412. doi:10.1207/s15374424jccp2304_6 

Durmer, J. S., & Dinges, D. F. (2005). Neurocognitive consequences of 
sleep deprivation. Seminars in Neurology, 25, 117-129. doi:10.1055/s- 
2005-867080 

Dworak, M., Schierl, T., Bruns, T., & Struder, H. K. (2007). Impact of 
singular excessive computer game and television exposure on sleep 
patterns and memory performance of school-aged children. Pediatrics, 
120, 978-985. doi:10.1542/peds.2007-0476 

Edwards, F. (2012). Early to rise? The effect of daily start times on 
academic performance. Economics of Education Review, 31, 970-983. 
doi:10.1016/j.econedurev.2012.07.006 

El-Sheikh, M., Buckhalt, J. A., Keller, P. S., Cummings, E. M., & Acebo, 
C. (2007). Child emotional insecurity and academic achievement: The 
role of sleep disruptions. Journal of Family Psychology, 21, 29-38. 
doi:10.1037/0893-3200.21.1.29 5 

Emery, R. E., & Laumann-Billings, L. (1998). An overview of the nature, 
causes and consequences of abusive family relationships. American 
Psychologist, 53, 121-135. doi:10.1037/0003-066X.53.2.121 

Epstein, R., Chillag, N., & Lavie, P. (1998). Starting times of school: 
Effects on daytime functioning of fifth-grade children in Israel. Sleep: 
Journal of Sleep Research & Sleep Medicine, 21, 250-256. 

Evans, G. W. (2004). The environment of childhood poverty. American 
Psychologist, 59, 77-92. doi:10.1037/0003-066X.59.2.77 

Evans, G. W., & English, K. (2002). The environment of poverty: Multiple 
stressor exposure, psychophysiological stress, and socioemotional ad- 
justment. Child Development, 73, 1238-1248. doi:10.1111/1467-8624 
.00469 

Evans, G. W., & Kim, P. (2012). Childhood poverty and young adults’ 
allostatic load: The mediating role of childhood cumulative risk 
exposure. Psychological Science, 23, 979-983. doi:10.1177/ 
0956797612441218 

Fallone, G., Owens, J. A., & Deane, J. (2002). Sleepiness in children and 
adolescents: Clinical implications. Sleep Medicine Reviews, 6, 287-306. 
doi:10.1053/smrv.2001.0192 

Federman, M., Garner, T., Short, K., Cutter, W., Levine, D., McGough, D., 
& McMillen, M. (1996, May). What does it mean to be poor in America? 
Monthly Labor Review, (5), 3-17. 

Goddard, R. D., & Goddard, Y. L. (2001). A multilevel analysis of the 
relationship between teacher and collective efficacy in urban schools. 
Teaching and Teacher Education, 17, 807-818. doi:10.1016/S0742- 
051X(01)00032-4 

Grant, K. E., Compas, B. E., Stuhlmacher, A., Thurm, A., McMahon, S., 
& Halpert, J. (2003). Stressors and child and adolescent psychopathol- 
ogy: Moving from markers to mechanisms of risk. Psychological Bul- 
letin, 129, 447-466. doi:10.1037/0033-2909.129.3.447 

Herbers, J. E., Cutuli, J. J., Supkoff, L. M., Heistad, D., Chan, C. K., Hinz, 
E., & Masten, A. S. (2012). Early reading skills and academic achieve- 
ment trajectories of students facing poverty, homelessness, and high 
residential mobility. Educational Researcher, 41, 366-374. doi: 
10.3102/0013189X12445320 

Hoff, E., Laursen, B., & Tardiff, T. (2002). Socioeconomic status and 
parenting. In M. H. Bornstein (Ed.), Handbook of parenting (2nd ed., pp. 
231-252). Mahwah, NJ: Erlbaum. 

Irwin, M., McClintick, J., Costlow, C., Fortner, M., White, J., & Gillin, 
J. C. (1996). Partial night sleep deprivation reduces natural killer and 
cellular immune responses in humans. FASEB Journal, 10, 643-653. 

Jacob, B. A., & Rockoff, J. E. (2011). Organizing schools to improve 
student achievement: Start times, grade configurations, and teacher 
assignments. Washington, DC: Hamilton Project. 


Kakkar, R. K., & Berry, R. B. (2009). Asthma and obstructive sleep apnea: 
At different ends of the same airway? Chest, 135, 1115-1116. doi: 
10.1378/cHest.08-2778 

Kentucky Department of Education. (2011). Kentucky school report cards 
(2011-2012) (Data set]. Retrieved from http://applications.education.ky 
.gov/SRC/DataSets.aspx 

Kentucky Department of Education. (2012). Researchers. Retrieved from 
http://education.ky.gov/Pages/default.aspx 

Kilgore, K., Snyder, J., & Lentz, C. (2000). The contribution of parental 
discipline, parental monitoring, and school risk to early-onset conduct 
problems in African American boys and girls. Developmental Psychol- 
ogy, 36, 835-845. doi:10.1037/0012-1649.36.6.835 

Kirby, M., Maggi, S., & D’Angiulli, A. (2011). School start times and the 
sleep-wake cycle of adolescents: A review and critical evaluation of 
available evidence. Educational Researcher, 40, 56-61. doi:10.3102/ 
0013189X11402323 

Knutson, K. L., & Lauderdale, D. S. (2009). Sociodemographic and be- 
havioral predictors of bed time and wake time among U.S. adolescents 
aged 15 to 17 years. The Journal of Pediatrics, 154, 426-430. doi: 
10.1016/j.jpeds.2008.08.035 

Kopasz, M., Loessl, B., Hornyak, M., Riemann, D., Nissen, C., Piosczyk, 
H., & Voderholzer, U. (2010). Sleep and memory in healthy children and 
adolescents - a critical review. Sleep Medicine Reviews, 14, 167-177. 
doi:10.1016/j.smrv.2009. 10.006 

Ladd, G. W., & Dinella, L. M. (2009). Continuity and change in early 
school engagement: Predictive of children’s achievement trajectories 
from first to eighth grade? Journal of Educational Psychology, 101, 
190-206. doi:10.1037/a0013153 

Ladd, H. F. (2012). Education and poverty: Confronting the evidence. 
Journal of Policy Analysis and Management, 31, 203-227. doi:10.1002/ 
pam.21615 

Laird, J., Cataldi, E. F., KewalRamani, A., & Chaoman, C. (2008). 
Dropout and completion rates in the United States: 2006 (Report No. 
2008-053). Retrieved from http://nces.edu.gov/pubsearch/pubsinfo 
.asp?pubid= 2008053 

Legot, C., London, B., Rosofsky, A., & Shandra, J. (2012). Proximity to 
industrial toxins and childhood respiratory, developmental, and neuro- 
logical diseases: Environmental ascription in East Baton Rouge Parish, 
Louisiana. Population and Environment, 33, 333-346. doi:10.1007/ 
s11111-011-0147-z 

Li, S. H., Arguelles, L., Jiang, F., Chen, W. J., Jin, X. M., Yan, C. H.,... 
Shen, X. M. (2013). Sleep, school performance, and a school-based 
intervention among school-aged children: A sleep series study in China. 
PLOS One, 8(7), €67928. doi:10.1371/journal.pone.0067928 

Liaw, F., & Brooks-Gunn, J. (1994). Cumulative familial risks and low 
birth weight children’s cognitive and behavioral development. Jour- 
nal of Clinical Child Psychology, 23, 360-372. doi:10.1207/ 
$15374424jccp2304_2 

Liu, R. X., & Chen, Z. (2006). The effects of marital conflict and marital 
disruption on depressive affect: A comparison between women in and 
out of poverty. Social Science Quarterly, 87, 250-271. doi:10.1111/j 
.1540-6237.2006.00379.x 

Lufi, D., Tzischinsky, O., & Hadar, S. (2011). Delaying school starting 
time by one hour: Some effects on attention levels in adolescents. 
Journal of Clinical Sleep Medicine, 7, 137-143. 

Lynch, J. W., Kaplan, G. A., & Shema, S. J. (1997). Cumulative impact of 
sustained economic hardship on physical, cognitive, psychological, and 
social functioning. The New England Journal of Medicine, 337, 1889—- 
1895. doi: 10.1056/NEJM1997 12253372606 

Milam, A. J., Furr-Holdeén, C. D. M., & Leaf, P. J. (2010). Perceived school 
and neighborhood safety, neighborhood violence, and academic achieve- 
ment in urban school children. The Urban Review, 42, 458-467. doi: 
10.1007/s11256-010-0165-7 


SCHOOL START TIMES 245 


Myers, D., Baer, W., & Choi, S. (1996). The changing problems of 
overcrowded housing. Journal of the American Planning Association, 
62, 66-84. doi:10.1080/01944369608975671 

National Center for Education Statistics. (2000). Condition of America’s 
public school facilities: 1999 (Report No. 2000-032). Washington, DC: 
U.S. Department of Education. 

National Sleep Foundation. (2005a). Changing school start times: Fayette 
County, Kentucky. Retrieved from http://www.sleepinfairfax.org/docs/ 
CS.Fayette.pdf 

National Sleep Foundation. (2005b). Changing school start times: Jessa- 
mine County, Kentucky. Retrieved from http://www.sleepinfairfax.org/ 
docs/CS.Jessamine.pdf 

National Sleep Foundation. (2011). 20/1 Sleep in America poll: Commu- 
nications technology in the bedroom. Retrieved from http:// 
teensneedsleep files. wordpress.com/201 1/05/national-sleep-foundation- 
2011-sleep-in-america-poll-communications-technology-in-the- 
bedroom.pdf 

Owens, J. A., Belon, K., & Moss, P. (2010). Impact of delaying school start 
time on adolescent sleep, mood, and behavior. Archives of Pediatric and 
Adolescent Medicine, 164, 608-614. doi:10.1001/archpediatrics 
.2010.96 

Owens, J., Maxim, R., McGuinn, M., Nobile, C., Msall, M., & Alario, A. 
(1999). Television-viewing habits and sleep disturbance in school chil- 
dren. Pediatrics, 104(3), e27. doi:10.1542/peds.104.3.e27 

Paavonen, E. J., Aronen, E. T., Moilanen, I., Piha, J., Rasanen, E., Tam- 
minen, T., & Almgvist, F. (2000). Sleep problems of school-aged 
children: A complementary view. Acta Paediatrica, 89, 223-228. doi: 
10.1111/4).1651-2227.2000.tb01220.x 

Preacher, K. J., Curran, P. J., & Bauer, D. J. (2006). Computational tools 
for probing interaction effects in multiple linear regression, multilevel 
modeling, and latent curve analysis. Journal of Educational and Behay- 
ioral Statistics, 31, 437—448. doi:10.3102/1076998603 1004437 

Raudenbush, S. W., & Bryk, A. S. (2002). Hierarchical linear models: 
Applications and data analysis (2nd ed.). Thousand Oaks, CA: Sage. 

Sadeh, A., Gruber, R., & Raviv, A. (2003). The effects of sleep restriction 
and extension on school-age children: What a difference an hour makes. 
Child Development, 74, 444—455. doi:10.1111/1467-8624.7402008 

Sameroff, A. J., Bartko, W. T., Baldwin, A., Baldwin, C., & Seifer, R. 
(1998). Family and social influences on the development of child com- 


petence. In M. Lewis & C. Feiring (Eds.), Families, risk, and compe- 
tence (pp. 161-183). Mahwah, NJ: Erlbaum. 

Sameroff, A. J., Seifer, R., Barocas, R., Zax, M., & Greenspan, S. (1987). 
Intelligence quotient scores of 4-year-old children: Social environmental 
risk factors. Pediatrics, 79, 343-350. 

Shaw, T. C., De Young, A. J., & Rademacher, E. W. (2004). Educational 
attainment in Appalachia: Growing with the nation, but challenges 
remain. Journal of Appalachian Studies, 10, 307-329. 

Shen, J., Leslie, J. M., Spybrook, J. K., & Ma, X. (2012). Are principal 
background and school processes related to teacher job satisfaction? A 
multilevel study using schools and staffing survey 2003-04. American 
Educational Research Journal, 49, 200-230. doi:10.3102/ 
0002831211419949 

Wahlstrom, K. L. (2002). Accommodating the sleep patterns of adolescents 
within current educational structures: An uncharted path. In M. A. 
Carskadon (Ed.), Adolescent sleep patterns: Biological, social, and 
psychological influences (pp. 172-197). New York, NY: Cambridge 
University Press. 

Wenglinsky, H. (2002). The link between teacher classroom practices and 
student academic performance. Education Policy Analysis Archives, 
10(12), 1-30. 

Wilson, S., & Gore, J. (2009). Appalachian origin moderates the associa- 
tion between school connectedness and GPA: Two exploratory studies. 
Journal of Appalachian Studies, 15, 70-86. 

Wolfson, A. R., & Carskadon, M. A. (1998). Sleep schedules and daytime 
functioning in adolescents. Child Development, 69, 875-887. doi: 
10.1111/j.1467-8624.1998 .tb06149.x 

Wolfson, A. R., Spaulding, N. L., Dandrow, C., & Baroni, E. M. (2007). 
Middle school start times: The importance of a good night’s sleep for 
young adolescents. Behavioral Sleep Medicine, 5, 194-209. doi: 
10.1080/15402000701263809 

Ziliak, J. P. (2012). The Appalachian Regional Development Act and 
economic change. In J. P. Ziliak (Ed.), Appalachian legacy: Economic 
opportunity after the war on poverty (pp. 19-44). Washington, DC: 
Brookings Institution Press. 


Received November 1, 2013 
Revision received April 4, 2014 
Accepted April 9, 2014 @ 


Journal of Educational Psychology 
2015, Vol. 107, No. 1, 246-257 


© 2014 American Psychological Association 
0022-0663/15/$12.00 _ http://dx.doi.org/10.1037/a0037389 


Developmental Dynamics Between Children’s Externalizing Problems, 
Task-Avoidant Behavior, and Academic Performance in Early School 


Years: A 4-Year Follow-Up 


Riitta-Leena Metsdpelto, Eija Pakarinen, Noona Kiuru, Anna-Maija Poikkeus, 


Marja-Kristiina Lerkkanen, and J ari-Erik Nurmi 
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This longitudinal study investigated the associations among children’s externalizing problems, task- 
avoidant behavior, and academic performance in early school years. The participants were 586 children 
(43% girls, 57% boys). Data pertaining to externalizing problems (teacher ratings) and task-avoidant 
behaviors (mother and teacher ratings) were gathered, and the children were tested yearly on their 
academic performance in Grades 1—4. The results were similar for both genders. The analyses supported 
a mediation model: high externalizing problems in Grades 1 and 2 were linked with low academic 
performance in Grades 3 and 4 through increases in task-avoidant behavior in Grades 2 and 3. The results 
also provided evidence for a reversed mediator model: low academic performance in Grades 1 and 2 was 
associated with high externalizing problems in Grades 3 and 4 via high task avoidance in Grades 2 and 
3. These findings emphasize the need to examine externalizing problems, task-avoidant behavior, and 
academic performance conjointly to understand their developmental dynamics in early school years. 


Keywords: externalizing problems, task-avoidant behavior, academic performance, longitudinal study, 


cross-lagged associations 


Externalizing problems and maladaptive achievement behaviors 
constitute major problems in primary school and compromise 
students’ learning outcomes and adjustment at school. Previous 
research has shown that externalizing problems are linked to low 
reading and math attainments (Adams, Snowling, Hennessy, & 
Kind, 1999), lower cognitive abilities and academic achievement 
(Bub, McCartney, & Willett, 2007), a higher incidence of repeat- 
ing a class, and a diminished probability of graduating from high 
school and attending college (McLeod & Kaiser, 2004). Likewise, 
maladaptive achievement behavior, as indicated by avoidance of 
learning tasks and adoption of strategies that interfere with learn- 
ing (e.g., procrastination), has been found to predict subsequent 
poor academic performance (Aunola, Nurmi, Niemi, Lerkkanen, & 
Rasku-Puttonen, 2002; Magi, Haidkind, & Kikas, 2010; Midgley 
& Urdan, 1995). 

In this study, we drew together and extended previous work on 
these two lines of research by investigating the cross-lagged asso- 
ciations between externalizing problems, task-avoidant behavior, 
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and academic performance, using multiwave panel data. Prior 
research has provided only scant evidence of the linkages between 
externalizing problems and task-avoidant behavior, and little is 
known about how they might combine to contribute to children’s 
academic performance. Our first question concerned the extent to 
which externalizing problems influence children’s achievement 
via task-avoidant behavior. The goal was to increase understanding 
of the mechanisms through which problem behaviors and achieve- 
ment strategies in the learning contexts are intertwined to affect the 
academic outcomes of children in the beginning of their school 
career. Our second question concerned reversed mediator effects, 
that is, the extent to which low academic performance increases 
externalizing problems via task-avoidant behavior. Evidence has 
linked academic difficulties to increases in externalizing problems, 
but studies investigating the mediating mechanisms are rare. 


Externalizing Problems and Academic Performance 


Externalizing problems refer to a broad category of disruptive 
behaviors, such as aggressiveness, oppositional behavior, conduct 
problems, hyperactivity, and attention deficit problems (Mc- 
Mahon, 1994). Early signs of externalizing problems, manifested 
in substantial noncompliance, aggression toward peers, high ac- 
tivity level, and poor regulation of impulses, can be identified 
already in the toddler and preschool years (Campbell, Shaw, & 
Gilliom, 2000). Although the average levels of externalizing prob- 
lems tend to decrease from preschool age to young adulthood 
(Bongers, Koot, van der Ende, & Verhulst, 2003), individual 
differences in externalizing problems persist at moderate levels 
from early to middle childhood (Deater-Deckard, Dodge, Bates, & 
Pettit, 1998) and through adolescence (Broidy et al., 2003). Re- 


y 


EXTERNALIZING PROBLEMS, TASK AVOIDANCE, AND SKILLS 


search on factors predisposing children to externalizing problems 
implicates multiple risk factors (Deater-Deckard et al., 1998). 
They include individual factors such as deficits in self-regulation 
(Olson et al., 2011) and difficult temperament (Miller-Lewis et al., 
2006) or contextual factors such as poverty (Grant et al., 2003), 
parental use of harsh and punitive discipline (Olson et al., 2011), 
and rejection by peers (Laird, Jordan, Dodge, Pettit, & Bates, 
2001). Externalizing behaviors are also more prevalent among 
boys than girls in early and middle childhood (Bongers et al., 
2003; Leadbeater, Kuperminc, Blatt, & Herzog, 1999). 

Externalizing problems in early childhood and school age have 
been shown to predict various adverse mental health and psycho- 
social outcomes (Caspi, 2000; Fergusson, Horwood, & Ridder, 
2007). In addition, children with externalizing problems often fail 
to take advantage of learning opportunities in the classroom. For 
instance, children with symptoms or diagnosis of attention-deficit/ 
hyperactivity disorder are much more likely to exhibit academic 
impairments in reading, writing, and mathematics than children 
without such symptoms (Barry, Lyman, & Klinger, 2002; Mc- 
Conaughy, Volpe, Antshel, Gordon, & Eiraldi, 2011; Spira & 
Fischel, 2005). Conduct and oppositional deficit disorders also 
co-occur with learning difficulties and academic underachieve- 
ment, although these associations are less consistent and at least 
partly accounted for by linkages with attention-deficit/hyperactiv- 
ity disorder (Frick et al., 1991; Hinshaw & Lee, 2003). Literature 
on the broadband construct of externalizing problems shows that 
children and adolescents with externalizing problems exhibit def- 
icits both in general academic competence (Burt & Roisman, 2010; 
Masten et al., 2005) and in the development of more specific skills, 
such as reading, writing, and math (Adams et al., 1999; Gresham, 
Lane, MacMillan, & Bocian, 1999; Hinshaw, 1992; Nelson, 
Benner, Lane, & Smith, 2004). Academic difficulties may result 
from disruptive behavior, leading students to overlook vital infor- 
mation and fail to follow teachers’ instructions (Atkins, McKay, 
Talbott, & Arvanitis, 1996). The negative dynamic between be- 
havior and achievement may also manifest as avoidance of tasks or 
assignments in the classroom. Furthermore, children with exter- 
nalizing problems have more conflicts with teachers and more 
negative attitudes in teacher—student relationships than children 
without behavioral difficulties (Henricsson & Rydell, 2004). 

The association between externalizing problems and academic 
outcomes may also run in the opposite direction, that is, academic 
achievement affecting externalizing behavior. Masten et al. (2005) 
pointed out that the emergence of emotional and behavioral prob- 
lems is related to the failure to accomplish age-salient develop- 
mental tasks, such as integration into school and successful acqui- 
sition of knowledge and skills. Through normative comparisons 
with their classmates, children become more aware of their aca- 
demic progress and standing with respect to abilities (Sutherland, 
Lewis-Palmer, Stichter, & Morgan, 2008). Students for whom 
schoolwork is difficult develop negative self-perceptions of ability 
(Chapman, 1988), and they may feel embarrassment, frustration, 
and general antagonism toward school, which, in turn, may set in 
motion defiance and aggressive behavior (Miles & Stipek, 2006; 
Roeser, Eccles, & Strobel, 1998). Accordingly, Halonen, Aunola, 
Ahonen, and Nurmi (2006) found that problems in learning to read 
predicted an increase in externalizing problem behavior during the 
first 2 years of primary school. Moreover, McGee, Williams, 
Share, Anderson, and Silva (1986) followed a group of boys from 
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ages 5 to 11 years and found evidence of reciprocal relations with 
behavior problems predicting reading disability and reading diffi- 
culties further aggravating the existing problem behaviors. These 
kinds of reciprocal effects have also been found among older 
students. In their follow-up of children from fifth to ninth grade, 
Zimmermann, Schiitte, Taskinen, and K6ller (2013) found evi- 
dence of an escalating cycle of negative outcomes in that high 
externalizing problems predicted lower school grades in reading 
and language skills and math that, in turn, contributed to increased 
externalizing problems both directly and via lowered self-esteem. 


Task-Avoidant Behavior and Academic Performance 


In addition to externalizing problems, another frequent concern 
in schools is students’ failure to develop academic skills and 
knowledge due to a tendency to avoid challenging tasks instead of 
actively attempting to perform them. This maladaptive achieve- 
ment strategy has been described using various concepts such as 
self-handicapping (Jones & Berglas, 1978), helplessness (Dweck 
& Leggett, 1988), and task-avoidant behavior (Nurmi, Aunola, 
Salmela-Aro, & Lindroos, 2003; Zhang, Nurmi, Kiuru, Lerkkanen, 
& Aunola, 2011). These concepts share the key idea that failures 
in learning situations create a negative self-concept and low effi- 
cacy beliefs, which increase the likelihood of developing expec- 
tations of future failure, leading to low effort and task avoidance in 
learning settings (Aunola et al., 2002; Sideridis, 2003). The avoid- 
ance of learning tasks has been found to be more common among 
boys than girls (Midgley & Urdan, 1995; Onatsu-Arvilommi & 
Nurmi, 2000; Pakarinen et al., 2011). 

Task avoidance has been linked to a host of negative conse- 
quences, such as slow development in reading and math in early 
school years (Aunola et al., 2002; Georgiou, Manolitsis, Nurmi, & 
Parrila, 2010; Hirvonen, Tolvanen, Aunola, & Nurmi, 2012; Magi 
et al., 2010), learning difficulties (Sideridis, 2003), and low aca- 
demic performance in young adulthood (Zuckerman, Kieffer, & 
Knee, 1998). Conversely, children’s ability to focus on tasks, 
sustain effort, and persist in the face of difficulties has been found 
to predict better achievement outcomes (Duncan et al., 2007; 
Hughes, Luo, Kwok, & Loyd, 2008). 

The evidence also suggests the opposite predictive path where 
learning difficulties and slow academic progress predict decreas- 
ing task involvement and high avoidance behavior in early school 
years (Aunola et al., 2002) or already around the transition to 
primary school (Lepola, Salonen, & Vauras, 2000; Pakarinen et al., 
2011). Onatsu-Arvilommi and Nurmi (2000), for instance, re- 
ported cumulative cycles in which learning difficulties and task- 
avoidant behaviors reciprocally influence each other by showing 
that first graders’ tendency to avoid learning tasks decreased: their 
subsequent progress in reading skills. Further, a low level of 
literacy skills increased their subsequent task-avoidant behaviors. 


Developmental Links Among Externalizing Problems, 
Task-Avoidance, and Academic Performance 


Externalizing problems and task-avoidant behavior have typi- 
cally been investigated separately and little is known about how 
these risk behaviors co-vary and operate in conjunction with each 
other to predict academic achievement. In the current research, we 
aimed to draw these two approaches together by investigating the 
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cross-lagged associations between externalizing problems and 
task-avoidant behavior, and their associations with academic per- 
formance in the early school years. Developmental dynamics of 
this kind can only be detected using longitudinal data that allow 
the controlling of pre-existing and concurrent associations, so that 
the influences from one domain of functioning to another can be 
evaluated from a developmental perspective (e.g., Bornstein, 
Hahn, & Haynes, 2010; Masten et al., 2005). 

The sparse available research supports the view that externaliz- 
ing problems co-occur with low involvement in learning tasks 
(Arnold, 1997; Coie & Dodge, 1988) and school engagement (Risi, 
Gerhardstein, & Kistner, 2003; Wagner & Cameto, 2004), signif- 
icantly increasing the risk for learning difficulties and school 
dropout. The specific interest in the present study was in examin- 
ing a mediation model wherein externalizing problems are linked 
with low academic performance through increases in task-avoidant 
behavior. There are at least two mechanisms through which exter- 
nalizing problems may bring out task-avoidant behavior. First, 
externalizing problems are known to increase negative feedback 
and conflictual interactions with teachers (Henricksson & Rydell, 
2004; Ladd, Birch, & Buhs, 1999; Murray & Murray, 2004; 
Nurmi, 2012; Stipek & Miles, 2008). The accumulation of unre- 
warding experiences in classroom interactions and learning situa- 
tions may generate low competence beliefs, failure expectations, 
and feelings of animosity toward school that, in turn, lead to low 
inclination to exert effort in academic work. The links between 
externalizing symptoms and low task focus have been indicated, 
for instance, by findings of Coie and Dodge (1988), documenting 
that aggressive first- and third grade children were more likely to 
spend time “off task” in the classroom than children with average 
peer status, and by those of Arnold (1997) showing that the 
aggressive, hostile, and noncompliant behavior of 4- to 6-year-old 
boys was associated with low on-task behavior. 

Second, externalizing problems are often accompanied by dif- 
ficulties in attending to and complying with teachers’ instructions 
in learning tasks. Prior research has underlined the central role of 
early self-regulatory mechanisms, such as effortful control and 
behavioral inhibition, in the development and persistence of ex- 
ternalizing behavior disorders (Olson, Sameroff, Kerr, Lopez, & 
Wellman, 2005; Olson et al., 2011). Low skills in controlling 
attention and behavior (high distractibility) have been shown to 
predict young children’s active task avoidance at school (Hir- 
vonen, Aunola, Alatupa, Viljaranta, & Nurmi, 2013). Thus, defi- 
cits in attention and self-regulation of behavior—the key features 
of externalizing problems—can be expected to lead to low persis- 
tence and completion of tasks in learning settings. 

As outlined previously, both task-avoidant behavior and exter- 
nalizing problems disrupt classroom learning processes because 
they interfere with the child’s ability to direct and sustain attention 
on academic activities and work in a self-regulated fashion. Task- 
avoidant behavior represents a maladaptive achievement strategy 
that students use to cope with situational demands and stress when 
confronted with challenging learning tasks (Nurmi et al., 2003; 
Zhang et al., 2011). Task avoidance is assumed to be gradually 
built over time based on a history of academic difficulties and 
ensuing negative self-perceptions, which lead to expectations of 
subsequent failure and anxiety in new learning situations. To cope 
with such expectations and feelings, students use task-avoidant 
behavior either to decrease anxiousness (Miller, 1987) or to create 


an excuse (Jones & Berglas, 1978). A persistent pattern of task 
avoidance is likely to decrease the time that a child spends in 
effective academic endeavors and affects his or her choices of 
learning tasks. In contrast, externalizing problems are not re- 
stricted specifically to learning tasks but contain a larger set of 
negative reactions or out-of-bounds behavior expressed in several 
kinds of environments (i.e., in academic tasks as well as interper- 
sonal relationships with teachers and peers; McMahon, 1994). 
When examined simultaneously, task avoidance and externalizing 
problems appear to have differential associations with academic 
outcome measures. Using six longitudinal data sets at school entry, 
Duncan et al. (2007; see also Morgan, Farkas, Tufis, & Sperling, 
2008) showed that a child’s ability to control and sustain attention 
and participate in classroom activities—not externalizing prob- 
lems—predicted later reading and math skills. These findings 
challenge the body of evidence showing linkages between exter- 
nalizing problems and academic underachievement (Adams et al., 
1999; Hinshaw, 1992; Masten et al., 2005). Therefore, research 
allowing analysis of mediator effects are needed for researchers to 
gain understanding of the interplay between externalizing prob- 
lems and task-avoidance in predicting academic achievement. 

The present study tests the assumption that externalizing prob- 
lems (i.e., poor conduct, hyperactivity, and inattentiveness) affect 
the child’s achievement strategies by increasing task-avoidant be- 
havior, further, leading to poor academic performance. While 
some of the links that we examine have been documented in prior 
studies, the designs have typically allowed for investigating only 
two of the three critical measures without integrating all of them 
into the same study. A design looking at the mediating paths 
conjointly allowed us to obtain a more comprehensive picture of 
how behavioral problems interfere with young students’ daily 
functioning in school and compromise their ability to benefit 
optimally from the learning opportunities. The examination of 
mediated effects is valuable because more efficient interventions 
focusing on the relevant variables in the mediating process can be 
developed and put into action (MacKinnon & Fairchild, 2009). 
Prior research on the links between externalizing problems and 
task-avoidant behavior has also relied on cross-sectional data 
(Coie & Dodge, 1988) or samples of very young children from 
low-income families (Arnold, 1997). The present study addresses 
the need to examine how different developmental domains (e.g., 
behavioral problems, achievement strategies, and academic 
achievement) influence each other over longer periods as children 
move through the school system (Roeser et al., 1998). 

In regard to the possibility of mediated paths from academic 
performance via task avoidance to externalizing problems over 
time, prior research indicates that early learning problems often set 
in motion negative developmental cycles (e.g., Aunola, Leskinen, 
Lerkkanen, & Nurmi, 2004). Consistent associations have been 
documented between learning difficulties and maladaptive 
achievement strategies, such as task-avoidant behavior (Halonen et 
al., 2006; Lepola et al., 2000; McGee et al., 1986; Onatsu- 
Arvilommi & Nurmi, 2000; Pakarinen et al., 2011). On the basis of 
these findings, it is plausible that early academic difficulties are 
related to subsequent task-avoidant behavior, which in turn in- 
creases externalizing problems. This association has rarely been 
tested with the exception of the study by Arnold (1997), which 
provided evidence of the negative escalating cycle in preschool- 
age high-risk boys, showing that low academic skills explained 
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low on-task behavior, which in turn increased externalizing prob- 
lems. 


The Present Study 


The purpose of this study was to examine cross-lagged associ- 
ations between externalizing problems, task-avoidant behavior, 
and academic performance in the early primary school years. We 
had the following aims and hypotheses: 

1. To what extent do externalizing problems predict academic 
performance, and is this association mediated by task-avoidant 
behavior? We assumed that high externalizing problems would 
predict low academic performance (Hypothesis 1a) and that this 
association would partly be mediated via high task-avoidant be- 
havior (Hypothesis 1b). 

2. To what extent does academic performance predict external- 
izing problems, and does task-avoidant behavior mediate this 
association? We hypothesized that low academic performance 
would predict high externalizing problems (Hypothesis 2a) and 
that high task-avoidant behavior would partly contribute to this 
association (Hypothesis 2b). 


Method 


Participants and Procedure 


Participants in the present study were part of an extensive 
follow-up (Lerkkanen et al., 2006) in which a community sample 
of Finnish-speaking children (NV = 1,880) were followed up from 
kindergarten to the end of Grade 4. The follow-up took place in 
four towns—two in Central Finland, one in Western Finland, and 
one in Eastern Finland. The children’s average age was 85.82 
months (SD = 3.45 months) at school entry. At the beginning of 
the study, the children’s parents and teachers were asked for their 
written consent. 

The study sample consisted of a more intensively followed 
subsample of 586 children (43% girls and 57% boys) drawn from 
the original sample of 1,880 children. This subsample consisted of 
children identified with risk for reading difficulties at the end of 
kindergarten (n = 282) and randomly selected control children 
from the same classrooms (n = 304). Risk for reading difficulty 
(RD; for details, see Lerkkanen, Ahonen, & Poikkeus, 2011) was 
determined on the basis of kindergarten assessment for pre-reading 
skills (i.e., letter knowledge, phonemic awareness, and rapid au- 
tomatized naming) and information on the parents’ reading diffi- 
culties, indicated by either the mother or father self-reporting 
“mild” or “severe” problems with reading at school age. A child 
was identified as having a risk for RD if he or she scored at or 
below the 15th percentile of the total sample in at least two of the 
measured skill areas, or if the child scored at or below the 15th 
percentile in one of the skill areas and the parental questionnaire 
indicated a family risk. It should be noted that the screening took 
place at an age at which the children had not received formal 
reading instruction, and it did not include early decoding skills, 
phoneme blending, or manipulation skills. Since the criterion for 
risk was set to be lenient, we would expect that only a subgroup of 
children identified with this early risk would develop difficulties in 
reading acquisition. From the other participants of the follow-up 
(n = 1,690), a random sample of nonrisk children who did not 
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meet the risk criteria were also included in the individual 
follow-up assessment from the first grade onward (n = 258). The 
random selection of the nonrisk sample was carried out from 
classrooms in a stratified fashion. Due to variation in class size, the 
number of nonrisk children from different classrooms ranged 
between one and six, with a median of three children. Target 
sampling of children for the individual follow-up was necessary to 
ensure that the data collection demands placed on the teachers 
were not too heavy. 

The sample was representative of the Finnish population (Sta- 
tistics Finland, 2007). In 7% (general population 6%) of the 
families, the parents had not been educated beyond comprehensive 
school (compulsory education up to the completion of Grade 9); 
31% (general population 30%) had completed upper secondary 
education (senior high school or vocational school, Grades 10— 
12); 36% (general population 35%) had a bachelor’s degree or 
vocational college degree (3-year education at a college or univer- 
sity); and 26% (general population 29%) had a master’s degree 
(5-year university education) or higher. 

The data on the students’ externalizing problems (teacher rat- 
ings) and task-avoidant behaviors (mother and teacher ratings) 
were gathered in Grade 1 (April 2008; T1), Grade 2 (April 2009; 
T2), Grade 3 (April 2010; T3), and Grade 4 (April 2011; T4). The 
children were tested on their academic performance in Grade 1 
(April 2008; T1), Grade 2 (April 2009; T2), Grade 3 (April 2010; 
T3), and Grade 4 (April 2011; T4). 


Measures 


Task-avoidant behavior. Task-avoidant behavior was as- 
sessed in Grades 1—4 by asking the mothers and teachers to 
evaluate the extent of the child’s task-avoidant behavior using the 
Behavior Strategy Rating Scale (BSR; Onatsu-Arvilommi & 
Nurmi, 2000) rated on a 5-point scale (1 = not at all; 5 = to a 
great extent). The mothers were asked to evaluate their child’s 
behavior in typical homework situations, and teachers were asked 
to assess their students’ typical behavior in learning situations at 
school. The combination of mother and teacher ratings provided an 
assessment of task-avoidant behavior across a variety of learning 
situations, both at school and at home. The following five items 
were used: 

(a) Does the child have a tendency to find something else to do 
instead of focusing on the task at hand? 

(b) If the activity or task is not going well, does the child lose 
his or her focus? 

(c) Does the child give up on tasks easily? 

(d) Does the child actively attempt to solve even difficult 
situations and tasks? (reversed) 

(e) Does the child demonstrate initiative and persistence in his 

or her activities and tasks? (reversed). 
The correlations between mother and teacher reports on task 
avoidance have been found to range from .36 (Grade 1) to .48 
(Grade 2), indicating moderate convergent validity (see Zhang et 
al., 2011). A composite score for task-avoidant behavior for each 
grade was calculated as a mean of mother’s and teacher’s items. 
The Cronbach’s alphas for the mean scores were .89, .91, .86, and 
.89 in Grades 1—4, respectively. 

Externalizing problems. Externalizing problems were as- 
sessed by teacher-ratings in Grades 1—4 using a Finnish version of 
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the Strengths and Difficulties Questionnaire (SDQ; Goodman, 
1997), which has been shown to be a highly valid screening 
instrument (Goodman, Ford, Simmons, Gatward, & Meltzer, 
2000). The SDQ consists of 25 items rated on a 3-point scale (1 = 
not true, 2 = somewhat true, and 3 = certainly true), producing 
scales for hyperactivity/inattention, conduct problems, emotional 
symptoms, peer problems, and prosociality. We used the scales for 
hyperactivity/inattention (five items; e.g., restless, overactive, can- 
not stay still for long, thinks things out before acting [reversed] 
and conduct problems (five items; e.g., often fights with other 
children or bullies them, generally obedient, usually does what 
adults request [reversed]) to measure externalizing problems. One 
item, “sees tasks through to the end, good attention span (re- 
versed),” was excluded from the Externalizing Problems Scale 
because it correlated highest with task-avoidance items, and in the 
exploratory factor analyses, it had moderate loadings on both the 
hyperactivity/inattention and task-avoidance factors. The compos- 
ite score for externalizing problems in Grades 1—4 (Time 1—-Time 
4: T1-T4) was formed as a mean score of the hyperactivity/ 
inattention and conduct problems scales. The Cronbach’s alpha 
reliabilities for the Externalizing Problems composite were .82, 
.84, .84, and .78, in Grades 1—4, respectively. 

Academic performance. Academic performance in Grades 
1-4 was assessed using an aggregate constructed on the basis of 
the children’s scores on reading and arithmetic tasks. 

Reading skills (decoding and comprehension) were measured 
using a nationally standardized test battery developed for students 
between Grade 1 and Grade 6 (ALLU [Reading test for primary 
school]; Lindeman, 1998). The decoding test was a speeded test in 
which a maximum of 80 items could be attempted within a 2-min 
time limit. For each item, there was a picture with four words next 
to it. The child was asked to read the four phonologically similar 
words and to draw a line to semantically match the picture to the 
word. The score was derived by calculating the number of correct 
answers (maximum score 80). Because of the time limit, the score 
reflects both the child’s fluency in reading the stimulus words and 
his or her accuracy in making the correct choice from among the 
alternatives. Differentiation between children’s rate of reading 
acquisition in the highly transparent Finnish language requires a 
speeded test already at the end of Grade 1 because approximately 
a third of children learn to decode before entering school and tests 
of reading accuracy without a time limit reach a ceiling very fast 
(see Lerkkanen, Rasku-Puttonen, Aunola, & Nurmi, 2004). The 
parallel versions of the test, A and B, were used on alternate years. 
Alternate-form reliability between Forms A and B was .84. No 
ceiling effect was evident in Grade 4. The Kuder—Richardson 
reliability coefficient for the decoding fluency task in Grades 1—4 
was .97, .97, .97, and .87, respectively. 

The reading comprehension test assessed the child’s skills in 
gleaning factual knowledge, concepts, and inferences from text. 
The children were asked to answer 12 multiple-choice questions 
based on silently read text. The children received 1 point for each 
correct answer (maximum score 12). They completed the task at 
their own pace, but the maximum time allotted was 45 min. This 
normed test included different texts and multiple choice questions 
for Grade 1 through Grade 4 so that task difficulty was optimal for 
each age. The topics of texts were the following: “Judo” (Grade 1), 
“Guidelines for Gymnastics” (Grade 2), “Operating a Camera” 
(Grade 3), and “The Light Requirements of Plants” (Grade 4). No 


ceiling effect was evident in Grade 4. The Kuder—Richardson 
reliability coefficient for the reading comprehension task in Grades 
1—4 was 85, .80, .75, and .76, respectively. 

Arithmetic skills were assessed using a group-administered Ba- 
sic Arithmetic Test (BAT; Aunola & Rasinen, 2007), which was 
designed for Grades 1-6. It contains visually presented addition, 
subtraction, and multiplication problems (total of 28 items) which 
become more difficult across primary school years. The test is 
speeded (3-min time limit), and because of this, it remains chal- 
lenging even for the oldest children. At the first grade, the test 
consisted of 14 items for addition (e.g.,2 + 1 = ? and3 + 4 + 
6 = 2) and 14 items for subtraction (e.g., 4-1 = ? and 20 - 2- 
4 = ?) problems. The items remain the same until fourth grade, 
after which some of the easiest tasks were replaced with more 
challenging ones toward the end of the test (e.g., multiplication 
problems). The test indexes a combination of speed and accuracy 
of math performance, and its psychometrics have been shown in a 
number of earlier publications (e.g., Niemi et al., 2011; Zhang, 
Koponen, Rasanen, Aunola, Lerkkanen, & Nurmi, 2014). The test 
score was derived by calculating the total number of correct 
answers (maximum score 28). The Kuder—Richardson reliability 
coefficient for the task in Grades 1-4 was .84, .86, .87, and .85, 
respectively. 

To calculate the composite score for Academic Performance in 
Grades 1-4 (T1-T4), the children’s test scores on the reading 
(decoding fluency and reading comprehension) and arithmetic 
were standardized and their mean score was calculated. The Cron- 
bach’s alpha for the Academic Performance composite score was 
.76, .69, .60, and .66, in Grades 1-4, respectively. 


Analysis Strategy 


We first estimated a stability model (M, without any cross- 
lagged paths; see Figure 1) in which externalizing problems, task 
avoidance, and academic performance were predicted by their 
preceding values across time. The study variables were allowed to 
correlate with each other at each time point. In the second model 
(M,), cross-lagged paths from task-avoidant behaviors to external- 
izing problems, from academic performance to task-avoidant be- 
haviors, and from academic performance to externalizing problems 
were estimated. In the third model (M,), cross-lagged paths from 
task-avoidant behaviors to academic performance, from external- 
izing problems to task-avoidant behaviors, and from externalizing 
problems to academic performance were estimated. In the fourth 
model (M,) all cross-lagged paths were estimated. At each step, 
the Satorra—Bentler scaled chi-square difference test was used to 
test the improvement of model fit between the competing models. 

The model construction was continued by testing whether the 
cross-lagged paths could be constrained as equal across the grades. 
The Satorra—Bentler scaled chi-square difference test was used to 
determine if constraining the cross-lag paths equal at different 
grades resulted in a better fit than the model in which the paths 
were allowed to be freely estimated. As the next step, the statistical 
significance of hypothesized mediated effects was examined. Fi- 
nally, the multigroup procedure was used to test whether the paths 
differ between boys and girls or between children at risk for 
reading difficulties and nonrisk control children. 

The analyses were performed with the Mplus statistical package 
(Version 7; Muthén & Muthén, 1998-2012) using the standard 
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Figure I. Test of cross-legged relationships in the study with four models (M,-M,). ACA = academic 
performance; TASK = task-avoidant behavior; EXT = externalizing problems. 


missing at random (MAR) approach and_full-information 
maximum-likelihood estimation with nonnormality robust stan- 
dard errors (MLR; Muthén & Muthén, 1998-2012). The goodness 
of fit of the estimated models was evaluated according to the 
following four indicators: (a) chi-square test, (b) comparative fit 
index (CFI), (c) root-mean-square error of approximation (RMSEA), and 
(d) standardized root-mean-square residual (SRMR). 

Since the data were hierarchical in nature (i.e., each teacher 
assessed more than one child), we investigated the interdepen- 
dency of the children’s (N = 586) externalizing problems and task 
avoidance within classrooms (N = 155 classrooms). This investi- 
gation was done by means of calculating the intraclass correlations 
(ICC) using the teacher identification number as a clustering 
variable (Heck & Thomas, 2009). The resulting ICCs for exter- 
nalizing problems were .07, .09, .08, and .13, and ps = .08, .05, 
.07, and .03, in Grades 1—4, respectively. The ICCs for teacher- 
rated task-avoidance were .05, .04; .12, and .12, and ps = .24, .41, 
.01, and .02, in Grades 1—4, respectively. Next, the design effects 
were obtained according to the ICCs in Grades 1—4, and they were 
1.19, 1.21, 1.18, and 1.27 for externalizing problems and 1.13, 
1.08, 1.26, and 1.26 for task-avoidance, respectively. Hox and 
Maas (2002) suggested that analyzing multilevel data as single- 
level data can yield acceptable (not overly biased) parameter 
estimates and inferential tests, if the design effects are smaller than 
2.0. However, in our subsequent analyses, we used Type = COM- 
PLEX approach (Muthén & Muthén, 1998-2012). This command 
estimates the model at the whole sample level but corrects for 
distortions in standard errors in the estimation caused by the 
clustering of observations (i.e., classroom differences). 


Results 


ov 


Descriptives and Correlations 


The descriptive statistics are shown in Table 1. The correlations 
of study variables across grades indicated moderate-to-high inter- 
individual stability. For externalizing problems, the correlations 
ranged from .55 to .72, and those for task-avoidant behavior and 
academic performance ranged from .54 to .72 and from .71 to .81, 
respectively. Externalizing problems showed moderate positive 


correlations with task-avoidant behavior. Furthermore, both exter- 
nalizing problems and task-avoidant behavior showed moderate or 
low negative correlations with academic performance. 


Cross-Lagged Relationships Among Externalizing 
Problems, Task-Avoidant Behavior, and Academic 
Performance 


We first estimated the stability model (M,) without any cross- 
lagged paths (Figure 1). The goodness-of-fit indices indicated a 
poor model fit, x7(46, N = 586) = 451.48, p < .001; CFI = 0.89, 
RMSEA = 0.12, SRMR = 0.16. The modification indices sug- 
gested that the fit would be improved if there was a direct path (a) 
from academic performance in Grade 1 to academic performance 
in Grade 4, (b) from academic performance in Grade 2 to academic 
performance in Grade 4, (c) from academic performance in Grade 
1 to academic performance in Grade 3, (d) from task-avoidant 
behavior in Grade 1 to task-avoidant behavior in Grade 3, (e) from 
task-avoidant behavior in Grade 2 to task-avoidant behavior in 
Grade 4, (f) from externalizing problems in Grade 1 to external- 
izing problems in Grade 3, and (g) from externalizing problems in 
Grade 2 to externalizing problems in Grade 4. After adding these 
paths, the model fit the data well: 7(40, N = 586) = 188.61, p= 
.000; CFI = 0.96, RMSEA = 0.08, SRMR = 0.11. 

We then estimated M,, in which paths from task-avoidant be- 
haviors to externalizing problems, from academic performance to 
task-avoidant behaviors, and from academic performance to exter- 
nalizing problems were estimated; M, in which paths from task- 
avoidant behaviors to academic performance, from externalizing 
problems to task-avoidant behaviors, and from externalizing prob- 
lems to academic performance were estimated; and finally, M,, in 
which all cross-lagged paths were estimated (Figure 1). As can be 
seen in Table 2, the Satorra—Bentler-scaled chi-square difference 
test showed that the difference between the stability model (M,) 
and M, was statistically significant, with M, better accounting for 
the data. The Satorra—Bentler-scaled chi-square difference test 
further showed that the difference between the stability model M, 
and M, was statistically significant, indicating that M, provided a 
better fit with the data. The comparison between stability model 
M, and M, showed that the difference was statistically significant, 
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Table 1 
Sample Correlation Matrix and Descriptive Statistics of the Study Variables 
Variable 1 2 3 4 > 6 7 8 9 10 11 12 
1. Externalizing problems (T1, 
Grade 1) 1.00 
2. Externalizing problems (T2, 
Grade 2) wD; 1.00 
3. Externalizing problems (T3, 
Grade 3) .60 .68 1.00 
4. Externalizing problems (T4, 
Grade 4) on .63 .67 1.00 
5. Task avoidance (T1, Grade 1) 59 A9 43 Al 1.00 
6. Task avoidance (T2, Grade 2) 50 61 50 47 70 1.00 
7. Task avoidance (T3, Grade 3) 44 5 ig capl 61 69 1.00 
8. Task avoidance (T4, Grade 4) 46 2 51 .62 54 .60 ae 1.00 
9. Academic performance (T1, : 

Grade 1) ell 3 18 18 AS 43 —.44 =736 1.00 
10. Academic performance (T2, 

Grade 2) a os Pp) Sei — 44 —.44 — 43 39) 81 1.00 
11. Academic performance (T3, 

Grade 3) ee. 2) aoe SoS ay — 41 —.44 —.40 ait 719 1.00 
12. Academic performance (T4, 

Grade 4) ao =P) paral oll — 41 —.45 —.47 —.43 73 81 .80 1.00 
Mean 1.49 1.47 1.47 39 2.67 2.64 2.65 53) 004 024 021 0 
SD 0.47 0.48 0.48 0.40 0.92 0.92 0.94 0.92 0.84 0.84 0.81 0.80 
Median 1.33 1.33 1.33 e22 2.60 2.60 2.60 AO F-03078 ONT: [Olen SOs 
Min 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1:00).54=2.05) =2:49. 23:14 3.16 
Max 2.89 2.89 3.00 3.00 5.00 5.00 5.00 5.00 2.76 2.32 1.60 1.83 
25 percentile ed 1.11 LA 1.11 2.00 1.90 1.90 SOs) eecO:05) ce O82. 0-09 we OTe 
75 percentile 1.78 1.78 1.78 1.67 331) 3.30 3.30 3.20 0.34 0.33 0.39 0.38 
Valid N 485 487 474 440 553 546 541 513 586 S72, 567 543 
% missing eee 16.9 19.1 24.9 5.8 7.0 7.8 12.6 0 2.4 322 a3 


Note. All correlations were significant at p < .01 (two-tailed testing of significance). T = Time; Min = minimum; Max = maximum. 


indicating that M, better accounted for the data. The further model 
comparisons (M, vs. M, and M3 vs. M,) revealed that M, with all 
the hypothesized cross-lagged paths, best fit the data. 

Next, the cross-lagged paths were set equal across the grades. 
The Satorra-Bentler scaled x* difference test showed that the 
model fit was not significantly decreased if all the cross-lagged 
paths were constrained to be equal across Grades 1—4 (p > .05). 

The fit of the final model M, was x7(37, N = 586) = 49.46, p = 
.08; CFI = 1.00, RMSEA = 0.02, SRMR = 0.02. The results of 


Table 2 
Goodness-of-Fit Statistics (Chi-Square) for the Nested Models 





Model comparisons: 
Satorra—Bentler-scaled y7 


Tested models x? df difference test 





1. No cross-lagged paths 


(M,) 188.610 40 
2. Cross paths (M,) 108.086 31 M, vs. M, x7(9) = 80.371, 
p < .001 
3. Cross paths (M;) 106.086 31 M, vs. M; x2(9) = 89.963, 
p< .001 
4. All cross paths (M,) 40.226 25 M, vs. M, x*(15) = 


151.093, p < .001 

M, vs. M, x7(6) = 72.611, 
p <.001 

M, vs. My, x7(6) = 61.058, 
p < .001 





this model are presented in Figure 2. High externalizing problems 
in Grades 1, 2, and 3 predicted high task-avoidant behavior in 
Grades 2, 3, and 4. High task-avoidant behavior in Grades 2 and 3, 
in turn, predicted a low academic performance in Grades 3 and 4. 
However, externalizing problems did not directly predict academic 
performance at any grade. In addition, the results showed that low 
academic performance in Grades 1, 2, and 3 predicted high task- 
avoidance in Grades 2, 3, and 4. High task avoidance in Grades 2 
and 3 was associated with higher externalizing problems in Grades 
3 and 4. Academic performance was not directly associated with 
externalizing problems. 

The statistical significance of the hypothesized mediator effects 
was tested. The estimates and standard errors regarding indirect 
effects are presented in Table 3. The results supported the hypoth- 
esized mediator effects. High externalizing problems in Grades 1 
and 2 were linked with low academic performance in Grades 3 and 
4 through increases in task-avoidant behavior in Grades 2 and 3. 
Conversely, low academic performance in Grades 1 and 2 was 
associated with high externalizing problems in Grades 3 and 4 via 
high task avoidance in Grades 2 and 3. 


Additional Analyses 


The multigroup method was used to test whether the previous 
paths differ between boys and girls. The Satorra—Bentler-scaled 
chi-square difference tests revealed that the model fit was not 
significantly decreased if the main effects among girls and boys 
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Figure 2. Results of final model M,: x?(37) 
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were constrained to be equal (p > .05). Similarly, the multigroup 
method was used to test whether the previous paths differ between 
children with a risk of reading difficulties and nonrisk children. 
The Satorra—Bentler-scaled chi-square difference tests revealed 
that the model fit was not significantly decreased if the main 
effects among at-risk children and controls were constrained as 
equal (p > .05). Thus, the multigroup analyses revealed no group 
differences suggesting that the cross-lagged path models were 
similar among boys and girls and also among children with and 
without risk for reading difficulties. 


Discussion 


In this study, we utilized four-wave cross-lagged panel data to 
investigate the associations among-children’s externalizing prob- 


Table 3 
Unstandardized Estimates of Indirect Effects: Task-Avoidant 
Behavior as a Mediator (N = 586) 





Indirect effect Estimate (SE) 
From externalizing problems via task-avoidant 
behavior to academic performance 
Externalizing problems (T1) —> Task avoidance 
(T2) — Academic performance (T3) 
Externalizing problems (T2) — Task avoidance 
(13) — Academic performance (T4) 
From academic performance via task-avoidant 
behavior to externalizing problems 
Academic performance (T1) — Task avoidance 
(T2) — Externalizing problems (T3) 
Academic performance (T2) — Task avoidance 
(T3) — Externalizing problems (T4) 


—0.026 (0.006)*"* 


—0.026 (0.006)"*™* 


—0.006 (0.001)*** 


—0.006 (0.001)*** 


Note. T = time. 
"9 < 001 (two-tailed testing of significance). 


lems, task-avoidant behavior, and academic performance in early 
school years. The results supported a mediation model in which the 
high externalizing problems in Grades 1 and 2 were linked with 
low academic performance in Grades 3 and 4 through increases in 
task-avoidant behavior in Grades 2 and 3. The results also pro- 
vided evidence for a reversed mediator model: Low academic 
performance in Grades 1 and 2 was associated with high external- 
izing problems in Grades 3 and 4 via high task avoidance in 
Grades 2 and 3. The results were similar for both genders. 

The first aim of this study was to investigate the mechanisms 
through which externalizing problems may impact children’s 
achievement. Instead of direct associations between externalizing 
problems and academic performance (Hypothesis la), we found 
evidence for an indirect mechanism through which externalizing 
problems set the stage for increased task-avoidant behavior in 
learning situations, which leads to lower academic performance 
(Hypothesis 1b). The indirect, linkages were observed beyond 
autocorrelation, and they were strikingly consistent across differ- 
ent time points of the longitudinal study. These findings are 
particularly important because our focus was on children who had 
just entered primary school. This period is critical in the develop- 
ment of basic academic skills and achievement-related strategies 
that children use to achieve learning goals (Magi, Lerkkanen, 
Poikkeus, Rasku-Puttonen, & Kikas, 2010). 

There are at least two mechanisms that may be responsible for 
the mediating role of task-avoidant behavior on the relation be- 
tween externalizing problems and children’s academic perfor- 
mance. First, externalizing problems are likely to increase conflic- 
tual interactions and negative feedback received from teachers to 
the child (Henricksson & Rydell, 2004; Ladd et al., 1999; Stipek 
& Miles, 2008), which, in turn, may generate low competence 
beliefs and failure expectations, and finally low inclination to exert 
effort needed for success in academic work (Nurmi et al., 2003). 
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Second, deficits in attention and self-regulation skills, which are 
typical externalizing problems (Olson et al. 2005; 2011), are likely 
to make it difficult for the child to stay focused and finish tasks, 
which hampers his or her learning and academic performance. 

As the second goal of the study, we investigated the association 
between low academic performance and subsequent externalizing 
problems and the extent to which this association was mediated by 
high task-avoidant behavior. Again, academic performance was 
not directly linked with externalizing problems (Hypothesis 2a). 
Instead, the results provided consistent support for the reversed 
mediation model (Hypothesis 2b). Children who showed low ac- 
ademic performance in reading and math in Grades 1| and 2 started 
to avoid learning tasks in Grades 2 and 3, and eventually had 
higher rates of externalizing problems. These results are in line 
with the findings of McGee et al. (1986) indicating reciprocal 
negative linkages between poor reading skills and externalizing 
problems. Our findings complement this literature results by show- 
ing that the effects of academic skills on externalizing problems 
are indirect, running through task-avoidant behavior. One possible 
explanation for the findings is that poor academic performance 
predisposes children to failure expectations that lead to task avoid- 
ance and off-task behavior (Aunola et al., 2002; Lepola et al., 
2000; Onatsu-Arvilommi & Nurmi, 2000; Pakarinen et al., 2011). 
These are likely to cause conflicts with both parents and teachers, 
which, in turn, exacerbate children’s oppositional and acting-out 
behavior (Arnold, 1997). 

Prior research is scarce on the potential mechanisms linking 
externalizing problems and academic achievement. As an excep- 
tion, a recent study by Zimmermann et al. (2013) showed that 
externalizing problems and low academic achievement recipro- 
cally affected each other from middle childhood to adolescence 
and that low self-esteem partially accounted for this association. 
The present study extends this line of research and provides new 
insights to our understanding of the processes by which academic 
performance and externalizing problems are linked over time. The 
multiwave cross-lagged panel data allowed investigation of indi- 
rect effects via task-avoidant behavior and provided a possibility to 
examine how behavioral regulation, achievement strategies, and 
academic learning shape each other over time. The findings sug- 


gest a cyclical nature of relationships between the studied vari- 


ables, which is likely to lead to the strengthening of the difficulties 
over time. Task-avoidant behavior played a key role in the forma- 
tion of these reciprocal associations. One explanation for the key 
role of task-avoidant behavior is that it reflects children’s motiva- 
tional and behavioral ways to approach learning tasks: it is an 
aspect of motivational orientation that is most closely related to 
learning in the classroom and homework situations (e.g., Morgan 
et al., 2008). Nevertheless, the dynamics between externalizing 
problems and academic achievement are also likely to include 
additional components, for example, quality of instruction and the 
nature of the relationship between children and their teachers 
(Liew, Chen, & Hughes, 2010). These developmental dynamics 
should be examined in future studies in greater detail. 

When interpreting the findings of this study, the following 
limitations should be considered. The first limitation concerns the 
data that were drawn from a follow-up sample with a special 
interest in the risk and protective factors affecting academic skill 
development and motivation. Consequently, children with an early 
risk for reading problems were overrepresented in the sample. This 


limitation was controlled for by testing the significance of the 
cross-laggéd paths separately for the children with and without risk 
for reading difficulties. The findings showed that the processes 
linking externalizing problems, task-avoidant behavior, and aca- 
demic performance were similar irrespective of kindergarten-age 
risk for reading difficulties. Similarly, the potential gender differ- 
ences were examined by testing the paths for girls and boys 
separately. These analyses demonstrated that the statistically sig- 
nificant paths did not vary systematically by gender. Our findings 
corroborate those of the previous studies showing that the associ- 
ations of externalizing problems and various academic measures 
are similar for boys and girls (Burt & Roisman, 2010; Nelson et al., 
2004). 

Second, the current study was based on correlational data, and 
caution should be exercised when interpreting the findings. The 
study does not provide a definite answer to the question of cau- 
sality. However, the four-wave cross-lagged panel data allow 
somewhat stronger inferences about the direction of effects than 
some earlier cross-sectional studies (Arnold, 1997). Finally, the 
coefficients for the indirect paths linking externalizing problems to 
academic achievement through:task-avoidant behavior were small 
in magnitude. This limitation should, however, be viewed against 
the moderate-to-high stability and within-time co-variation be- 
tween our constructs, which leave relatively little variance unex- 
plained. Moreover, the effect sizes of the indirect links have also 
been low in magnitude in many previous studies (e.g., Bornstein et 
al., 2010; van Lier et al., 2012). 

Our findings suggest a need for educational interventions con- 
cerning externalizing behaviors that interfere with children’s abil- 
ity to focus and persist in tasks (for a meta-analysis on school- 
based interventions, see Wilson & Lipsey, 2007). When 
successful, such interventions would help children to develop more 
adaptive learning strategies and possibly prevent the development 
of later mental health problems and antisocial behavior that often 
follow from early disruptive behaviors (Caspi, 2000; Fergusson et 
al., 2007). The results of the present study suggest that interven- 
tions that aim to reduce task-avoidant behavior by improving 
classroom quality and teacher—student relationships may also 
prove useful. Such an intervention would have ramifications both 
for the development of externalizing problems and academic per- 
formance. Previous findings indicate that the higher the quality of 
teacher’s instructional support in the classroom, the less children 
show task-avoidant behavior (Pakarinen et al., 2011). Teachers can 
help children to face difficult learning situations by giving indi- 
vidualized feedback and providing tasks that are optimally chal- 
lenging. They can also encourage children’s efforts and thereby 
decrease their anxiety in learning situations that easily leads to task 
avoidance. In addition, children’s motivation is likely to be fos- 
tered by classroom goal orientation, which emphasizes mastery, 
understanding, and improving skills and knowledge rather than 
demonstrating high ability or competing for grades (for a review, 
see Meece, Anderman, & Anderman, 2006). 

In addition, interventions targeted specifically to children with 
maladaptive achievement strategies should also be considered. 
One central mechanism underlying task-avoidant behavior is the 
child’s negative self-concept and failure expectations. Earlier find- 
ings indicate that by fostering children’s self-efficacy beliefs, for 
instance, by informing students of their capabilities and progress in 
learning and providing learning strategies that help them succeed, 
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it is possible to support children’s task persistence and engagement 
in learning, and eventually increase their academic knowledge and 
skills (Schunk & Pajares, 2001). Any prevention or intervention 
model that is aimed at supporting children’s school work should 
target to multiple domains. 


In sum, this study represents a unique effort to investigate the 
interrelationships among externalizing problems, task-avoidant be- 
havior, and academic performance, and how they shape each other 
in early school years. The strength of the study lies in the pro- 
spective longitudinal design with multiple assessments over time 
that allowed for testing the mediating effect of task-avoidant 
behavior in the cross-lagged associations between externalizing 
behavior and academic performance. In future research, investiga- 
tors should further examine the developmental trajectories of ac- 
ademic performance and externalizing problems across time and 
how a more varied set of motivational factors may help to explain 
the formulation of such trajectories. 
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Self-concept, dating back to at least the seminal work by William 
James (1890/1963), is one of the oldest and most important constructs 
in the social sciences. Today positive self-beliefs are also at the heart 
of a positive revolution sweeping psychology, which emphasizes 
focusing on how healthy, normal, and exceptional individuals can get 
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the most from life (e.g., Bandura, 2006; Bruner, 1996; Diener, 2000; 
Marsh & Craven, 2006; Seligman & Csikszentmihalyi, 2000). Thus, 
self-concept enhancement is now a major goal in many fields includ- 
ing education, child development, health, sport/exercise sciences, 
social services, and management (Marsh, 2007). Self-concept is also 
an important mediating factor that facilitates the attainment of other 
desirable outcomes. Particularly in education settings, a positive ac- 
ademic self-concept (ASC) is both a highly desirable goal and a 
means of facilitating subsequent learning and other academic accom- 
plishments. Our study is based on the 2007 Trends in International 
Mathematics and Science Study (TIMSS), with nationally represen- 
tative samples from different countries and age cohorts, to provide 
tests of the developmental and cross-sectional generalizability of 
strong theoretical models of self-concept using new and evolving 
statistical methodology. 


Big-Fish-Little-Pond Effect (BFLPE): 
The Theoretical and Substantive Focus 


BFLPE Theoretical Models 


Self-concept theory emphasizes that perceptions of the self 
cannot be adequately understood if frames of reference are ignored 
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(Marsh, 2007). The same objective characteristics and accomplish- 
ments can lead to disparate self-concepts, depending on the frames 
of reference or standards of comparison that individuals use to 
evaluate themselves, and these self-beliefs have important impli- 
cations for future choices, performance, and behaviors. From the 
time of William James (1890/1983), psychologists have recog- 
nized that objective accomplishments are evaluated in relation to 
frames of reference. Thus, James indicated, “we have the paradox 
of a man shamed to death because he is only the second pugilist or 
the second oarsman in the world” (p. 310). Marsh (1984; see also 
Marsh & Parker, 1984; Marsh, Seaton, et al., 2008) proposed the 
BFLPE to encapsulate frame of reference effects that are based on 
an integration of theoretical models and empirical research from 
diverse disciplines (e.g., relative deprivation theory, social com- 
parison theory, psychophysical judgment, social judgment; see 
supplemental materials). 

In the BFLPE model, students are hypothesized to compare their 
abilities with the abilities of their classmates and use this social 
comparison impression as one basis for forming their own self- 
concept (see Figure 1); individual ability is positively related to 
ASC (the brighter I am, the higher my ASC), but that class- and 
school-average ability have a negative effect on ASC (the brighter 
my classmates, the lower my ASC). Hence, ASC depends not only 
on a student’s academic accomplishments but also on those of the 
student’s classmates. Consistent with theoretical predictions and 
an increasing emphasis on the multidimensionality of self-concept, 
the BFLPE in academic settings is specific to ASC; class- and 
school-average ability achievement has little positive or negative 
effect on nonacademic components of self-concept or on global 
self-esteem (e.g., Marsh, 1987; Marsh, Chessor, Craven, & Roche, 
1995; Marsh & Parker, 1984; for a review, see Marsh, Seaton, et 
al., 2008). 

Diener and Fujita (1997, p. 350) reviewed BFLPE research in 
relation to the broader social comparison theory and concluded 
that BFLPE research provided the clearest support for predictions 
based on social comparison theory in an imposed social compar- 
ison paradigm. The reason for this, they surmised, was that the 
frame of reference, based on classmates within the same class or 
school, is more clearly defined in BFLPE research than in most 
other research settings. The importance of the class or school 
setting is that the relevance of the social comparisons in class or 
school settings is much more ecologically valid than manipulations 
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Figure 1. Conceptual model of the big-fish-little-pond effect (BFLPE). 


in typical social psychology experiments involving introductory 
psychology students in contrived settings. Indeed, they argue that 
except for opting out altogether, it is difficult for students to avoid 
the relevance of achievement as a reference point within a class or 
school setting or the social comparisons provided by the academic 
accomplishments of their classmates (see also Marsh, 2007). Sea- 
ton, Marsh, et al. (2008) provided a theoretical rationale for how 
the BFLPE fits with the broader social comparison research liter- 
ature, contrasting results for the imposed social comparisons and 
social comparison when students can freely choose their compar- 
ison targets. In support of the direct role of social comparison for 
the BFLPE, Huguet et al. (2009) demonstrated the BFLPE was 
largely eliminated after controlling pure measures of social com- 
parison. 

Extensive support for the BFLPE generalizes over student groups, 
subject domains, ASC instruments, and cultures (see reviews by 
Marsh, Seaton, et al., 2008; Seaton & Marsh, 2013). However, most 
BFLPE research has been based on high school students from West- 
ern developed countries (e.g., Australia, United States, Germany, 
Israel, France, the Netherlands, and the United Kingdom), as well as 
in Asian countries such as Hong Kong and Singapore. 


Cross-Cultural Support for the BFLPE 


In cross-cultural research there are two main orientations, one 
that focuses on tests of a priori hypotheses of cross-cultural dif- 
ferences and one that tests the replicability of existing theories in 
other cultures and seeks universal, panhuman theories (e.g., Parker 
et al., 2012; Segall, Lonner, & Berry, 1998, p. 1102). However, 
strong cross-cultural studies need to compare the results from at 
least two—and preferably many—countries based on comparable 
samples and the same measures; otherwise apparent cross-cultural 
differences are confounded with potential differences in the com- 
position of samples and, perhaps, the appropriateness of materials. 
Addressing these challenges, there is strong support for the cross- 
cultural generalizability of the BFLPE for high school students, 
based on successive data collections of the Organisation for Eco- 
nomic Co-operation and Development Program for International 
Student Assessment (PISA) data: Marsh and Hau (2003) used the 
PISA 2000 data based on 103,558 fifteen-year-old students from 
26 predominantly industrialized Western countries; Seaton, Marsh, 
and Craven (2009, 2010) used PISA 2003 (265,180 students, 
10,221 schools, 41 countries), which included more collectivist 
and developing economies than PISA 2000; Nagengast and Marsh 
(2012) used the PISA 2006 database in the largest cross-cultural 
study of the BFLPE undertaken to date, and significantly extended 
the previous PISA studies. In summarizing the three BFLPE—-PISA 
studies, Nagengast and Marsh reported that the effect of school- 
average achievement was negative in all but one of the 123 
samples across the three studies, and significantly so in 114 sam- 
ples. The average effect size across all 123 samples is —.223 (see 
detailed summary of this previous research in the supplemental 
materials, Table S3). 

The overarching cross-cultural focus of our study was to eval- 
uate the generalizability of the BFLPE in Middle Eastern Islamic 
countries (where there has been almost no research) to that found 
in Western countries (that is the basis of most research) and Asian 
countries (that has been the basis of some research). Although we 
note why these comparisons are potentially interesting, it was our 
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expectation that there would be reasonable support for the gener- 
alizability of the results. Indeed, Seaton et al. (2009) claimed 
support for the universality of the BFLPE as a panhuman theory 
based on PISA data. However, Schwartz and Bilsky (1990), as 
well as many others, observed, “Theories that aspire to universality 

. must be tested in numerous, culturally diverse samples” 
(p. 878). In this respect, one purpose of our study is to greatly 
expand the scope of tests of the BFLPE’s generalizability beyond 
the set of PISA studies that have been the primary basis of 
cross-cultural tests of its universality. 

PISA versus TIMSS. Although each of the successive PISA 
studies included a larger and more diverse sample of countries, 
there were important limitations that are the focus of the present 
investigation. For example, in their monograph on concerns related 
to PISA, Hopmann, Brinek, and Retzl (2007; see also Ertl, 2006) 
summarized a range of substantive, methodological, and policy- 
related concerns. These included the inappropriateness of the PISA 
model in respect of what is actually taught in many school systems, 
technical issues related to translation and scaling, problems with 
the sampling design, PISA’s focus on literacy in testing mathe- 
matics, issues in relation to gender, and the league table ranking of 
countries based on PISA results. Potential concerns such as these 
dictate that cross-cultural BFLPE research based almost exclu- 
sively on PISA data should be cross-validated with data from 
different sources. Hence, it is surprising that there is apparently no 
cross-cultural BFLPE research based on the TIMSS data. 

Systematic comparisons of results based on the TIMSS 
and PISA studies (e.g., American Institutes for Research, 2005; 
Hutchison & Schagen, 2007; National Center for Education Sta- 
tistics, 2008; Neidorf, Binkley, Gattis, & Nohara, 2006; Wu, 2009) 
emphasize many similarities but also important differences be- 
tween achievement tests in the two databases. In particular, PISA 
focuses more on the application of knowledge to “real life” prob- 
lems, while TIMSS focuses on achievement more closely linked to 
school curriculums. Wu (2009) reported that these differences in 
item content explain in part why Western countries tended to 
perform better on PISA than TIMSS, while Eastern European and 
Asian countries tended to perform better on TIMSS than PISA. 

A key distinction between TIMSS and PISA that is of particular 
relevance to the BELPE lies in the differences in the way data have 
been collected. PISA samples schools, rather than classrooms, and 
then tests a random sample of 15-year-olds from each school, so 
that participants within the same school typically come from two-, 
three-, or four-year cohorts. Even at the school level, this sampling 
design complicates interpretation of frame-of-reference effects 
based on school-average achievement, which typically does not 
correspond to the achievement levels of students in any of the 
different year cohorts actually considered. In contrast, although 
TIMSS also samples schools, it measures all students from se- 
lected intact classrooms. However, although TIMSS is nationally 
representative of each country in relation to classes, there is 
typically only a single class selected from each school and this 
class may or may not be representative of the school as a whole. 
Hence, the appropriate unit of analysis with TIMSS is the class- 
room rather than the school. 

Although the focus of both TIMSS and PISA has been on 
achievement scores, both databases include a range of psychoso- 
cial variables, including ASC responses that are the focus of the 
present investigation. For both PISA and TIMSS, considerable 


effort has been made to make the achievement scores comparable 
from one data collection to the next, although the rationale for 
testing achievement differs substantially in PISA and TIMSS. 
Whereas there has also been reasonable consistency in the items 
used to infer mathematics self-concept in the different TIMSS data 
collections (see discussion by Marsh et al., 2013; see also Method 
section for wording of TIMSS math self-concept items used here), 
this has not been the case for PISA. First of all, PISA typically 
only includes math self-concept responses when mathematics is 
the focus of the PISA data collection. Furthermore, the number and 
wording of items used in PISA to assess self-concept in the focal 
domain (math, science, or reading) varies substantially from one 
data collection to the next (for wording of PISA items in different 
data collections, see PISA website http://www.oecd.org/pisa/ 
aboutpisa/). Finally, on the basis of these comparisons, it is also 
obvious that the items used to assess self-concept are clearly quite 
different across the TIMSS and PISA studies. 

Local dominance effect. According to the local dominance 
effect (Zell & Alicke, 2009; see also Liem, Marsh, Martin, McIn- 
emey, & Yeung, 2013), the frame of reference—school versus 
classroom—is a potentially important consideration for BFLPE 
research. Zell and Alicke (2009; see also Alicke, Zell, & Bloom, 
2010) provided support for the BFLPE by experimentally manip- 
ulating the frame of reference in relation to feedback given to 
participants about how their performances compared to others. 
When they pitted “local” against more “general” comparison stan- 
dards, participants consistently used the most local comparison 
information available to them, even when they were told that the 
local comparison was not representative of the broader population 
and were provided with more appropriate normative comparison 
data. Hence, the class-average achievement based on the TIMSS 
data constitutes a more proximally relevant frame of reference than 
the school-average achievement based on PISA data, which is 
likely to be more locally dominant. In this respect, it is important 
to note that our study is apparently the first cross-cultural BFLPE 
study to be based on the classroom as the unit of analysis, rather 
than the school. 


Developmental Support for the Generalizability of 
the BFLPE 


For many developmental, educational, and psychological re- 
searchers, self-concepts are a “cornerstone of both social and 
emotional development” (Kagen, Moore, & Bredekamp, 1995, 
p. 18; see also Davis-Kean & Sandler, 2001; Marsh, Ellis, & 
Craven, 2002); self-concepts develop early in childhood.and, once 
established, they are enduring (e.g., Eder & Mangelsdorf, 1997). 
The development of self-concept is therefore emphasized in many 
early childhood programs (e.g., Fantuzzo, McDermott, Manz, 
Hampton, & Burdick, 1996). 

Hattie (1992; Hattie & Marsh, 1996; see also Eccles, Wigfield, 
Harold, & Blumenfeld, 1993; Harter, 1999, 2006, 2012; Marsh, 
Craven, & Debus, 1998) reviewed theoretical and empirical sup- 
port for stages of growth in the development of self-concept, 
arguing against the notion of fixed stages that all persons must pass 
through. Instead, he posited seven parallel developments that are 
relevant to self-concept formation: (a) children distinguish self and 
others, (b) children distinguish self and the environment, (c) 
changes in major reference groups lead to changes in expectations, 
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(d) attributions are made to salient personal and social or external 
sources, (€) cognitive processing capacities develop, (f) children 
develop particular cultural values, and (g) children develop strat- 
egies for confirmation and disconfirmation of self-referent infor- 
mation. Thus, with age and development, young children increas- 
ingly integrate information from their immediate environment into 
their self-concept formation. This is particularly relevant to the 
present investigation, emphasizing the integration of external 
frames of reference and social comparison into self-concept for- 
mation. 

Indeed, many authors (Chapman & Tunmer, 1995; Eccles et al., 
1993; Harter, 1999; Marsh, 1989; Marsh & Craven, 1997; Skaalvik 
& Hagtvet, 1990; Wigfield & Eccles, 1992; Wigfield et al., 1997) 
have offered a developmental perspective on the relation between 
ASC and academic achievement. For example, Marsh (1989, 1990; 
Marsh et al., 1998) proposed that the self-concepts of very young 
children are very positive and are not highly correlated with 
external indicators (e.g., skills, accomplishments, achievement, 
self-concepts inferred by significant others) but that with increas- 
ing life experience, children learn their relative strengths and 
weaknesses, so that specific self-concept domains become more 
differentiated and more highly correlated with external indicators. 
Marsh et al. (1998) showed that reliability, stability, and factor 
structure of self-concept scales improve with age (children 5-8 
years of age). In addition, consistent with the proposal that chil- 
dren’s self-perceptions become more realistic with age, self-ratings 
of older children were more correlated with inferred self-concept 
ratings by their teachers. 

In summary, there is good developmental theory for the predic- 
tion that with age and development ASC becomes more closely 
related to external criteria, including academic achievement and 
perceptions of others. From this, it is reasonable to speculate that 
the BFLPE would also become stronger with age and develop- 
ment, but there is little or no empirical evidence against which to 
evaluate this supposition. Testing this generalizability of the 
BFLPE over primary and secondary students is the major focus of 
our study. 


TIMSS 2007: Background to the Present Investigation 


An important aspect of TIMSS is collection of data from two 
age cohorts (corresponding to fourth and eighth grades), pro- 
viding a unique developmental perspective on cross-cultural 
studies of the BFLPE. Included in the present investigation are 
fourth- and eighth-grade classes (see Table 1 for more detail) in 
six Western countries (Australia, England, Italy, Norway, Scot- 
land, and United States), four Asian countries (Hong Kong, 
Japan, Singapore, and Taiwan), and three Middle Eastern Is- 
lamic countries (Iran, Kuwait, and Tunisia) where both math- 
ematics and science were taught as an integrated subject, and 

“where data were available for both fourth- and eighth-grade 
cohorts. 

In line with the substantial body of BFLPE, we predict that 
there will be good support for the developmental generalizabil- 
ity of the BFLPE across the two matched age cohorts, and for 
the cross-cultural generalizability of the BFLPE across the 13 
countries. In keeping with developmental models of self- 
concept (and limited empirical support), we posit that relations 
between ASC and achievement will be significant for all 26 (2 


Table 1 

Reliability of the Trends in International Mathematics and 
Science Study Math and Science Motivation Constructs Used in 
This Study 





Number of 
Country Cohort Students Classes Schools Reliability 
Western countries 
Australia 4 4,108 316 228 147 
8 4,119 238 227, .809 
England 4 4,316 233 142 WS 
8 4,046 238 136 795 
Italy 4 4,470 323 169 .687 
8 4,408 287 169 841 
Norway 4 4,108 290 144 .677 
8 4,748 264 138 .805 
Scotland 4 3,929 252 138 DS 
8 4,213 244 128 .770 
United States 4 7,896 eS 256 .763 
8 7,636 510 238 .838 
Total 4 28,827 1,929 1,077 Ps 
8 29,170 1,781 1,036 .810 
Asian countries 
Taiwan 4 4,131 174 149 W135) 
8 4,046 153 149 .838 
Hong Kong 4 3,791 147 15 SALT: 
8 B27) 120 119 .803 
Japan 4 4,487 189 147 762 
8 5,625 169 145 ITE 
Singapore 4 5,041 354 176 SID, 
8 4,754 326 163 825 
Total 4 17,450 864 597 143 
8 17,952 768 576 811 
Middle Eastern Islamic countries 
Iran 4 3,833 224 223 734 
8 3,981 208 207 .744 
Kuwait 4 3,803 181 149 351 
8 4,091 158 157 589 
Tunisia 4 4,134 DN 149 450 
8 4,080 169 149 729 
Total 4 11,770 622 521 512 
8 1DrIS2 535 513 .687 
All countries 
Grade 4 58,047 3,415 2,195 .681 
8 59,274 3,084 DADS 781 
Total W732 6,499 4,320 731 


Note. Reliabilities are expressed as Cronbach’s alpha estimates. 


age cohorts < 13 countries) groups but will be stronger for 
the older age cohort. Of central importance, we predict that the 
negative effect of class-average achievement on ASC—the 
BEFLPE—will also be significant across all 26 groups. However, 
we further surmise that the BFLPE will be stronger for the older 
cohort, in that developmental models posit that social compar- 
ison and normative processes in the formation of self-concept 
grow stronger over this developmental period; but we recognize 
that there is limited empirical support for this prediction. We 
leave as a research question whether there are substantively 
meaningful interactions between country and age cohort differ- 
ences; alternatively, the extent to which age cohort effects 
generalize across countries. 


262 MARSH ET AL. 


Method 


Participants 


TIMSS 2007 (Olson, Martin, & Mullis, 2008) assessed the 
competencies in mathematics for nationally representative samples 
of students from participating countries (for more details about the 
processes underlying the development of the TIMSS 2007 instru- 
ments, translation of materials, sampling, data collection, scaling, 
and data analysis, see the TIMSS 2007 technical report by Olson 
et al., 2008). The basic design is a two-stage cluster design that 
consists of sampling schools and intact classrooms from the target 
grade in the school. Included in the present investigation are 
117,321 students from 6,499 fourth- and eighth-grade classes 
representing 13 countries (see Table 1 for more detail). In all 
countries, the materials were administered near the end of the 
school year (typically October or November in the Southern Hemi- 
sphere and April to June in the Northern Hemisphere). For pur- 
poses of convenience and consistency with TIMSS 2008, we refer 
to the fourth-grade cohort (typically 9-11 years of age with 4 years 
of formal schooling) as primary school children and the eighth- 
grade cohort (typically 13-15 years of age with 8 years of formal 
schooling) as secondary school adolescents, but realize that this 
terminology is not completely consistent across all countries and 
school systems. 

TIMSS (Olson et al., 2008) used item response theory to scale 
student achievement scores based on a mixture of constructed 
response and multiple-choice items representing algebra, data and 
chance, number, and geometry for eighth-grade students and num- 
ber, geometric shapes and measures, and data display for fourth- 
grade students. Students in both age cohorts responded to items 
designed to measure math self-concept (MSC) on the same classic 
Likert (agree—disagree) response scale; two of the self-concept 
items had the same wording in the two age cohorts, but there were 
minor wording changes for the other two self-concept items (see 
supplemental materials): “I usually do well in math”; “Math is 
harder for me than for many of my classmates”; “I am just not 
good at science”; “I learn things quickly in math/science.” 


Data Analysis 


All analyses, conducted with Mplus 7.0 (Muthén & Muthén, 
2013), consisted of multilevel confirmatory factory analyses 
(CFAs) and structural equation models (SEMs) based on the 
Mplus robust maximum likelihood estimator, with standard errors 
and tests of fit that were robust in relation to nonnormality of 
observations and the use of Likert responses where there were at 
least four or more response categories, particularly where nonnor- 
mality was not excessive (e.g., Beauducel & Herzberg, 2006; 
DiStefano, 2002; Dolan, 1994; Muthén & Kaplan, 1985). Maxi- 
mum likelihood estimation is also robust to the nonindependence 
of the observations when used in conjunction with a design-based 
correction (Mplus’s complex design option; Muthén & Muthén, 
2013). All analyses were based on TIMSS’s HOUWGT weighting 
variable that incorporates three components related to sampling of 
the school, class, and student, respectively, and three associated 
with nonparticipation at the level of the school, class, and student. 
For present purposes, the 26 (13 countries X 2 age cohorts) groups 
were treated as grouping variables that were the basis of the 


multigroup analyses, whereas the class and school identification 
variable was treated as a clustering variable to control for the 
clustered sample (using the complex design option and robust 
maximum likelihood options in Mplus; Muthén & Muthén, 2013). 

In the TIMSS 2007 database, the achievement tests for each 
student are reported as five plausible values (Olson et al., 2008)— 
numbers drawn randomly from the distribution of scores that could 
be reasonably assigned to each student. Although the amount of 
missing data was small (an average of less than 2% per item), we 
used full-information maximum likelihood estimation to control 
for missing data, noting that this had been done separately for each 
of the five data sets based on different plausible values and then 
combined using the Rubin (1987; Schafer, 1997) strategy to obtain 
unbiased parameter estimates, standard errors, and goodness-of fit 
statistics. : 

Comparison of results across different countries, and age cohorts, 
requires strong assumptions about the invariance of the factor struc- 
ture. If the underlying factors differ fundamentally in different groups, 
then there is no basis for interpreting observed differences (the “ap- 
ples and oranges” problem; see Millsap, 2011). Here we initially 
consider invariance across the 26 (13 countries X 2 age cohorts) 
groups. In applied SEM research—particularly for large sample sizes 
as in TIMSS—there is a predominant focus on indices that are sample 
size independent (e.g., Hu & Bentler, 1999; Marsh, Balla, & McDon- 
ald, 1988; Marsh, Hau, & Grayson, 2005; Marsh, Hau, & Wen, 2004). 
The Tucker—Lewis index (TLI) and the comparative fit index (CFI) 
vary along a 0-1 continuum, and values greater than .90 and .95 
typically reflect acceptable and excellent fit to the data, respectively. 
For the root-mean-square error of approximation (RMSEA), values of 
less than .05 and .08 reflect a close fit and a minimally acceptable fit 
to the data, respectively. However, for purposes of model comparison, 
comparison of the relative fit of models imposing more or fewer 
invariance constraints, Cheung and Rensvold (2002) and Chen (2007) 
suggest that if the decrease in fit for the more parsimonious model is 
less than .01 for incremental fit indices like the CFI, then there is 
reasonable support for the more parsimonious model. Chen suggests 
that when the RMSEA increases by less than .015, there is support for 
the more constrained model. However, these guidelines are based on 
simulated data studies and practice typically involving only two or a 
small number of groups, and might not be fully applicable to studies 
like the present investigation, based on 26 groups. Hence, it is im- 
portant to emphasize that these are only rough guidelines (Marsh, 
Hau, & Wen, 2004), and it is recommended that applied researchers 
use an eclectic approach based on subjective integration of a variety 
of different indices—including the chi-square, detailed evaluations of 
parameter estimates in relation to theory, a priori predictions, common 
sense, and alternative models specifically designed to evaluate good- 
ness of fit in relation to key issues. This is consistent with the 
approach we used here. 

Latent contextual effect models: Substantive-methodological 
synergy. Only recently has BFLPE research integrated the appli- 
cation of multilevel models (e.g., students nested within classes) with 
the use of latent variable models (with multiple indicators of latent 
constructs, the multiple MSC items) and multiple group analyses to 
compare results across countries (see Liidtke, Marsh, Robitzsch, & 
Trautwein; Liidtke et al., 2008; Marsh, Liidtke, et al., 2009; Nagen- 
gast & Marsh, 2012). In the present application of this evolving 
Statistical approach (see Figure 2), we used manifest aggregation to 
form the class-average measure of achievement such that 
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Figure 2. Multilevel depiction of the big-fish-little-pond effect (BFLPE). MAch = math achievement; MSC = 


math self-concept. 


Self-concept = By + B; (achievement) + r 
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(1) 


where By is a random intercept and , is the effect of individual 
student achievement on self-concept; yo, represents the variation in 
B, that is explained by school-average achievement; r and 1, are 
residual terms. For TIMSS data, intact classes were sampled so that 
the sampling ratio approached 1.0 and so sampling error was minimal. 
Indeed, the use of latent aggregation (and the use of within-class 
achievement variation to estimate sampling error) would overcorrect 
~BFLPE estimates (see Marsh, Liidtke, et al., 2009; although the size 
of the bias would be small because of the substantial sample sizes). 

The contextual effects models were estimated with the reflective 
aggregation procedure in Mplus (Muthén & Muthén, 2013) that uses 
implicit group-mean centering of all L1 variables. This implies that 
the partial regression weights associated with L1 variables reflect L1 
effects, while the partial regression weights associated with L2 vari- 
ables reflect L2 effects that are not controlled for L1 differences 
(Enders & Tofighi, 2007; Kreft, de Leeuw, & Aiken, 1995). Estimates 


of contextual effects, that represent the effect of L2 variables after 
controlling for L1-differences, can be obtained by subtracting the L1 
effect from the L2 effect (Enders & Tofighi, 2007; Kreft et al., 1995): 


Besntent a Bro rv Bri > (2) 


where £, » is the L2 effect, B,, ;, the L1 effect, and B...4ex; is the 
contextual effect (see Figure 2). The standard error for the con- 
textual effect was obtained with the multivariate delta method (see 
Raykov & Marcoulides, 2004). 

In order to facilitate comparisons with previous research, effect 
sizes (ESs) for the BFLPE (the effect of class-average achievement 
on MSC after controlling for individual student achievement) were 
calculated according to the recommendations of Marsh, Liidtke, et 
al. (2009; Nagengast & Marsh, 2012) by the following formula: 


BFLPE ES = 2* 8 * Opreq/Oy; (3) 


where f is the unstandardized regression coefficient, 0,,.q is the 
standard deviation of the predictor variable (achievement), and a, 
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is the standard deviation of the outcome variable (self-concept), 
resulting in an ES metric that is common across countries. This ES 
is comparable to Cohen’s d (Cohen, 1988), reflecting differences 
based on classes 1 standard deviation above the mean and 1 
standard deviation below the mean. For MSC, students in all 26 
(13 countries X 2 cohorts) groups completed the same items, and 
so it was appropriate to standardize MSC responses in relation to 
a standard deviation that was common across all 26 groups. How- 
ever, for math achievement (MAch), the tests were completely 
different for the two age cohorts, so that we standardized the 
achievement scores to have M = 0 and SD = 1 across the 13 
countries within each of the two age cohorts. 

In the decomposition of group (13 countries X 2 age cohorts) 
into variance components and more detailed factorial (analysis-of- 
variance-like) contrasts, we relied heavily on the flexibility of the 
“model constraint” function in Mplus. The resulting tests of sta- 
tistical significance based on these model constraints were based 
on the delta method (Muthén & Muthén, 2013). Thus for example, 
we used these constraints to obtain analysis-of-variance-like esti- 
mates of the statistical significance and proportion of variation in 
the relations of MSC with student level (L1) achievement and 
class-average (L2) achievement that was explained by the 13 
countries (and three groups of countries: Western, Asian, Middle 
Eastern Islamic), two age cohorts (Grade 4 versus Grade 8), and 
Age Cohort Country interactions. These were followed by more 
specific tests of a priori hypotheses. This evolving methodology— 
combining the flexibility typically associated with analyses of 
manifest variables with latent variable models—is apparently a 
new contribution. 


Preliminary Results 


In preliminary analyses we estimated the average reliability of 
the MSC score and evaluated the a priori factor structure for these 
responses based on Marsh et al. (2013; see supplemental materials 
for more detail). Due in part to the brevity of the four-item MSC 
scale, reliability estimates (see Table 1) sometimes reached a 
desirable standard of .80, but in other cases fell below acceptable 
values of .70 or even .60. Reliability estimates were systematically 
higher for the older age cohort (mean a = .781) than the younger 
cohort (mean a = .681), and substantially lower in the Middle 
Eastern Islamic countries than in the Western or Asian countries. 
These systematic differences in reliability make problematic com- 
parisons based on manifest scale or composite scores, and support 
the need to consider latent variable models that control for unre- 
liability. 

Our a priori factor model (following from Marsh et al., 2013; 
see supplemental materials) is a simple model in which the four 
self-concept items are associated with one latent self-concept fac- 
tor, MAch is a single-item variable (represented by TIMSS’s five 
sets of plausible values that control for unreliability), and there is 
a negative-item method effect represented by a correlated unique- 
ness between the two negatively worded self-concept items. This 
model was supported based on a series of single-level multigroup 
(using the Mplus complex design to control for clustering of 
students within classes and schools) tests of invariance over 26 (2 
cohorts X 13 countries) groups. Next we tested multilevel— 
multigroup CFA models demonstrating the invariance of factor 
loadings within and across student (L1) and class (L2) levels as 


well as the 26 groups. Subsequent results supported the highly 
constrained model, in which all factor loadings were constrained to 
be the same across all 26 groups at both the student and class levels 
(CH =) 956.TLI = 794i RMSEA = .054; see supplemental 
materials for the Mplus syntax used to test this model), that was the 
basis of subsequent results. 


Results 


Support for the BFLPE requires that the effect of individual (L1) 
achievement on MSC is positive, while the effect of class-average 
(L2) achievement is negative (see Figure 1). The standardized path 
coefficients between individual student achievement and MSC are 
significantly positive in all 26 (13 countries X 2 cohorts) groups 
(M = .592, SE = .005; see Table 2). In contrast, the BFLPEs (ESs 
for the negative effect of class-average achievement on MSC) are 
significantly negative in all 26 groups (M = — 377, SE = .012; see 
Table 2). These results provide strong support for the generaliz- 
ability across countries and across age cohorts. We now address 
substantively important developmental and cross-cultural issues, 
evaluating how these effects vary as a function of age cohort, 
country, and their interaction. 


Relations Between Student Level (L1) Achievement 
and Self-Concept 


Averaged across all countries and age cohorts, achievement and 
MSC are positively correlated (M = .592, SE = .005; see Table 2). 
Next, we decomposed variance estimates into contrast tests of 
differences associated with the 13 countries, the two age cohorts, 
and their interactions; and estimated variance components for each 
of these differences (sums of squares and variance components in 
Table 2). Estimates for all 26 groups, the mean estimate for each 
country, the mean estimate for each cohort, and the means for each 
of the three country groupings are all significant and positive. The 
variance components associated with each effect—along with an 
inspection of the values for each of the 26 (2 cohorts x 13 
countries) groups—provide an indication of the sizes of the effects 
and how well they generalize over age cohorts and countries. 

Cohort effects. The size of the relation between MAch and 
MSC, averaged across all countries, is substantially larger for the 
older cohort (MV = .692, SE = .008) than for the younger cohort 
(M = .492, SE = .006). However, interpretations of cohort dif- 
ferences are complicated by Cohort X Country interactions, sug- 
gesting that cohort differences vary for different countries. The 
positive relations are substantially larger for secondary than pri- 
mary students (i.e., difference scores in Table 2 are significantly 
positive) in all Western and in all Islamic countries, but not in all 
the Asian countries. Although the mean estimate across the Asian 
countries is significantly (p < .05) higher for the older cohort, the 
difference is small and inconsistent across the Asian countries. 
Only in Singapore is the positive relation larger for secondary than 
primary students, and in Taiwan the positive relation is signifi- 
cantly larger for primary than for secondary students (cohort 
differences are not significant in Hong Kong and Japan). 

Country differences. Although the relation between self- 
concept and achievement is significantly positive for both cohorts 
in each of the 13 countries (see Table 2), estimates are more 
positive in the Western (M = .627, SE = .008) and Asian (M = 


Table 2 
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Effects of Individual Student Achievement and Class-Average 
Achievement on Math Self-Concept Broken Down by 13 


Countries X 2 Age Cohorts 


ess 
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Class-average 


Individual achievement 
Country Age cohort achievement (BFLPE effect size) 
Western countries 

Australia 4 583 (.021) — .358 (.042) 
8 .914 (.032) —.627 (.063) 

Diff .331 (.038) —.269 (.075) 

Total .749 (.019) —.493 (.038) 

England 4 460 (.018) —.294 (.048) 
8 .629 (.035) —.359 (.051) 
Diff .169 (.039) —.065 (.069)* 

Total 544 (.019) =o IGOS5) 

Italy 4 481 (.019) —.482 (.041) 
8 -832 (.020) —.907 (.068) 

Diff 351 (.028) —.425 (.079) 

Total .656 (.014) —.694 (.039) 

Norway 4 369 (.019) —.134 (.054) 
8 .937 (.022) —.527 (.086) 

Diff .568 (.030) — 393 (.104) 

Total .653 (.014) —.331 (.050) 

Scotland 4 364 (.019) —.418 (.072) 
8 563 (.032) — .282 (.058) 
Diff .199 (.036) .137 (.093)* 

; Total .463 (.019) —.350 (.046) 
United States 4 .631 (.018) — 352 (.038) 
8 .766 (.025) —.502 (.050) 

Diff .135 (.033) —.150 (.065) 

Total .699 (.014) — 427 (.030) 

Mean Western 4 -481 (.009) —.340 (.021) 
8 .774 (.013) — .534 (.026) 

Diff .292 (.016) —.194 (.034) 

Total .627 (.008) —.437 (.016) 

Asian countries 

Hong Kong 4 .609 (.032) — 441 (.062) 
8 .636 (.034) —.549 (.051) 
Diff —.027 (.054)* —.107 (.085)* 

Total .623 (.020) —.495 (.038) 

Japan 4 578 (.019) —.247 (.078) 
8 574 (.019) — .482 (.068) 

Diff .004 (.027)* — 236 (.106) 

Total 576 (.013) — 364 (.051) 

Taiwan 4 .688 (.026) —.475 (.077) 
8 581 (.019) —.180 (.056) 

Diff —.107 (.028) .295 (.095) 

Total .635 (.018) — .327 (.047) 

Singapore 4 589 (.019) —.211 (.040) 
8 .828 (.033) —.585 (.062) 

Diff .239 (.036) — .374 (.074) 

Total .708 (.020) — 398 (.037) 

Mean Asia 4 616 (.011) — .343 (.033) 
8 .655 (.016) —.449 (.031) 

Diff .039 (.018) —.106 (.047) 

Total .635 (.010) — .396 (.022) 

Middle Eastern Islamic countries 

Iran 4 510 (.022) —.175 (.044) 
8 580 (.023) — 362 (.047) 

Diff .069 (.031) —.186 (.064) 

Total 545 (.016) — 268 (.032) 

Kuwait 4 .236 (.017) —.089 (.038) 
8 .457 (.023) — 342 (.065) 

Diff 221 (.027) — 252 (.076) 

Total 347 (.015) — .216 (.037) 
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Table 2 (continued) 


Class-average 


Individual achievement 
Country Age cohort achievement (BFLPE effect size) 

Tunisia 4 .300 (.014) —.117 (.048) 
8 .703 (.026) —.314 (.102) 

Diff -403 (.030) OF. 

Total 502 (.015) —.216 (.057) 

Mean Islam 4 349 (.010) —.127 (.025) 
8 580 (.014) —.339 (.043) 

Diff .231 (.016) —.212 (.050) 

Total 464 (.009) =—-2 351025) 


Mean across all 26 (13 countries X 2 cohorts) groups 


Total 4 .492 (.006) —.292 (.015) 
8 .692 (.008) —.463 (.018) 
Diff .200 (.010) = 915.023) 
Total 592 (.005) = 5) COI) 
Sums of squares (SS) and variance components (VC)? 
SS cohort .260 (.026) .210 (.053) 
VC .067 .013 
SS country .300 (.033) 419 (.077) 
VC .077 .026 
SS interaction .205 (.021) .226 (.060) 
VC .053 014 
Note. For each country, the three country groupings, and the total across 


all 26 (13 countries X 2 cohorts) groups, results are presented for each age 
cohort (fourth grade and eighth grade), the difference (diff) between age 
cohorts, and the total across age cohorts. For each estimate there is a 
standard error (in parentheses) that can be used to assess statistical signif- 
icance (i.e., estimates divided by their standard error that are greater than 
1.96 are statistically significant at p < .05). For individual student achieve- 
ment, the estimates are the standardized path coefficients, while for the 
big-fish-little-pond effect (BFLPE) the estimates are effect sizes. 

* Effects across the 26 groups were decomposed to assess the main effects 
of differences due to the 13 countries, the two age cohorts, and their 
interaction (sums of squares and variance components). 

“Estimates that are not statistically significant in the predicted direction; 
all other estimates are statistically significant at p < .05. 


.635, SE = .010) than in the Islamic (MV = .464, SE = .009) 
countries. However, results for individual countries are not entirely 
consistent, even within each of the three country classifications, 
and these differences interact significantly with cohort. For the 
younger cohort, the relations are substantially more positive in 
Asian countries (MV = .616, SE = .011) thar: in the Western (M = 
.481, SE = .009) and particularly the Islamic (MV = .349, SE = 
.010) countries. However, for the older cohort, relations are sub- 
stantially more positive in the Western countries (M = .774, SE = 
.013) than in Asian (VM = .616, SE = .011) and particularly in 
Islamic (M = .580, SE = .014) countries. Averaged across cohorts, 
the estimates are most positive in Australia, Singapore, and the 
United States, but are also higher in Italy, Norway, Taiwan, Hong 
Kong, and Japan than in any of the three Islamic countries. 


The BFLPE: The Negative Effects of Class-Average 
Ability on MSC 


Averaged across all countries and age cohorts, class-average 
achievement has a negative effect on MSC (mean BFLPE 
ES = —.377, SE = .012; see Table 2). Although the BFLPE is 
significantly negative for each of the 26 (2 cohorts X 13 countries) 


266 MARSH ET AL. 


groups, its size does vary significantly with cohort, country, and 
their interaction,-as demonstrated by variance components—along 
with an inspection of the values for each of the 26 groups. 

Cohort effects. Because there have been no previous cross- 
cultural studies of the BFLPE with primary school students, the 
most important result of our study is that in all 13 countries the 
BFLPE is statistically significant and negative for the younger 
cohort, as well as the older cohort (see Table 2). An important 
contribution of our study is the finding—consistent with predic- 
tions—that the BFLPE is significantly larger in the eighth-grade 
cohort (mean BFLPE ES = —.463, SE = .018) than in the 
fourth-grade cohort (mean BFLPE ES = —.283, SE = .015). 
Furthermore, the mean BFLPE ESs are significantly more negative 
for the older cohort in each of the three country groups, and these 
cohort differences do not differ significantly from each other 
(West: —.194, SE = .065; Asian: —.106, SE = .047; Islamic: 
—.212, SE = .050). Again, however, interpretations of cohort 
differences are complicated by Cohort X Country interactions. For 
four countries (England, Scotland, Hong Kong, and Tunisia) the 
BFLPE did not differ significantly as a function of cohort, while in 
one country (Taiwan) the BFLPE was significantly more negative 
for the younger cohort. 

Country differences. The BFLPE is significantly negative for 
both cohorts in all 13 countries, but there are differences between 
countries (see Table 2). Across the cohorts, the BFLPE is more 
negative in the Western (mean BFLPE ES = —.437, SE = .016) 
and Asian (mean BFLPE ES = —.396, SE = .022) countries than 
in Islamic (mean BFLPE ES = —.233, SE = .025) countries. As 
already noted, the BFLPE is noticeably smaller in the younger 
cohort of Islamic countries (—.127, SE = .025). The BFLPE is 
particularly large in Italy (mean BFLPE ES = —.694, SE = .039), 
but is also very substantial in Hong Kong (mean BFLPE 
ES = —.495, SE = .038), Australia (mean BFLPE ES = —.493, 
SE = .038), and, to a lesser extent, the United States (mean BFLPE 
ES = —.427, SE = .030) and Singapore (mean BFLPE 
ES = —.398, SE = .037). The BFLPE was smallest in Tunisia 
(mean BFLPE ES = —.216, SE = .057) and Kuwait (mean BFLPE 
ES = —.216, SE = .037), particularly for the younger cohort. 


Discussion 


Substantively and theoretically, the most important result of the 
present investigation is that the BFLPE—the negative effect of 
class-average achievement on MSC—is statistically significant 
and generalizes across both age cohorts in all 13 countries, pro- 
viding good support for its developmental and cross-national gen- 
eralizability. This is important because this is the only large-scale 
cross-cultural study to compare the BFLPE across matched sam- 
ples of primary and secondary students from a broad array of 
different countries. 

Our study is also the first large-scale cross-cultural study of the 
BFLPE not based on PISA. Importantly, the consistency of the 
BFLPEs for both cohorts for the TIMSS data in our study is even 
stronger than in previous cross-cultural studies based on PISA 
data. Thus, the average BFLPE ES across 123 samples based on 
PISA data (59 countries sampled in one or more data collections in 
PISA 2000, PISA 2003, and PISA 2006) is —.223, while the 
average BFLPE ES across 24 samples (12 countries X 2 age 
cohorts) in the present study is —.377. Furthermore this general 


trend is reasonably consistent across overlapping countries that 
participated in both PISA and TIMSS. This might seem surprising, 
in that PISA data is based on somewhat older students—15-year- 
olds—than even the oldest TIMSS cohort, and our results, consis- 
tent with a priori development predictions, show that the BFLPE is 
more negative for older students (—.292 for Year 4, —.426 for 
Year 8). However, these findings are consistent with our a priori 
predictions based on the local dominance effect when comparing 
results based on school-average achievement (PISA) and class- 
average achievement (TIMSS). Nevertheless, there are a number 
of critical differences between TIMSS and PISA sampling designs 
that might explain, in part, these differences but also dictate 
caution in interpretation of the results: 

¢ The nature of the standardized achievement tests is more 
closely related to the academic curriculum in TIMSS than in PISA, 
so that the frame of reference based on class-average TIMSS is 
more closely associated with the achievement results (e.g., school 
grades, teacher feedback) that are actually experienced by these 
students. 

¢ The sampling unit for PISA is the whole school, while that of 
TIMSS is the individual class. Although both are relevant, there is 
some theoretical and empirical research (see earlier discussion of 
the local dominance effect; e.g., Liem et al. 2013; Zell & Alicke, 
2009) suggesting that the more proximal frame of reference asso- 
ciated with the class is stronger—more locally dominant—than 
that associated with the whole school, particularly if there is 
streaming or tracking within schools, so that there are systematic 
differences in achievement levels for different tracks within 
schools. 

¢ TIMSS samples intact classes that almost always represent a 
single-year group, whereas PISA samples 15-year-olds and typi- 
cally includes two, three, or even four year groups within a given 
school. Thus, BFLPE interpretations for PISA are complicated in 
that the school-average ability estimate is an aggregation of test 
scores across multiple-year groups that do not correspond to any 
one of the year groups actually considered. Also, school-average 
ability estimates in PISA studies are typically based on a relatively 
small proportion of the 15-year-olds in the school, so that at least 
moderate amounts of sampling error in the school-average esti- 
mates are likely; the school-average estimate is a sample mean 
estimate of the true school-average value if all 15-year-olds in the 
school are tested. Although recent advances in contextual models 
used to assess the BFLPE are able to control for sampling error 
(Marsh et al., 2009; Nagengast & Marsh, 2012), this is typically at 
the expense of statistical power. In contrast, class-average esti- 
mates of achievement based on TIMSS scores are based on intact 
classes, so that there is little or no sampling error. 

In summary, our results demonstrate that the size of the BFLPE 
is systematically larger for high school students than primary 
school students, although the BFLPE is clearly evident in both age 
cohorts. Our findings also suggest, consistent with the local dom- 
inance effect, that the BFLPE is substantially more negative for 
class-average achievement, as in TIMSS data, than for school- 
average achievement, as in PISA data. However, neither PISA nor 
TIMSS is ideal for testing this difference in that neither of these 
international comparisons allows researchers to distinguish prop- 
erly between the effects of class- and school-average achievement 
in a more appropriate three-level (L1 = students, L2 = classes, 
L3 = schools) model. Furthermore, although it is likely that future 
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research will be able to address this issue with data from individual 
countries, it seems unlikely that that future research will be able to 
test the cross-cultural generalizability of these results with data as 
comprehensive as the PISA and TIMSS data. However, a compre- 
hensive meta-analysis of BFLPE studies (that included PISA and 
TIMSS studies as well as the large number of studies done in 
individual countries) would be a useful addition to the literature. 

Our study is also apparently the first to specifically compare 
BFLPE results in a sample of Middle Eastern Islamic countries 
with those from Asian and Western countries, which have been the 
basis of most BFLPE research. Indeed, only one of these Islamic 
countries in our study (Tunisia) had participated in PISA. Al- 
though the BFLPE was statistically significant for both age cohorts 
in all the Islamic countries, it was significantly smaller—particu- 
larly in the younger age cohort. In line with earlier research in 
Arab and Islamic countries (Marsh et al., 2013; see also Abu-Hilal 
& Bahri, 2000), we also found that relations between L1 achieve- 
ment and MSC were significantly smaller in the Islamic coun- 
tries—again, particularly for the younger age cohort. These au- 
thors previously speculated that students from these countries do 
not receive as much evaluative feedback about their achievement 
as Western and Asian students, and are not socialized in such a 
way as to critically evaluate their academic skills in relation to 
classmates. Indeed, consistent with speculations by Abu-Hilal and 
Bahri (2000) that self-concept formation of ASC and its relation 
with achievement in Middle Eastern Islamic middle school stu- 
dents was similar to that found in younger students from Western 
countries, support for the BFLPE for eighth-grade Middle Eastern 
Islamic students is similar to that found for the fourth-grade cohort 
in the Western and Asian countries (for further discussion, see 
Abu-Hilal, 2001; Abu-Hilal & Aal-Hussain, 1997; Abu-Hilal & 
Bahri, 2000). 


Strengths, Limitations, and Directions for 
Further Research 


Important strengths of our study include the use of large, na- 
tionally representative samples of primary and secondary school 
students from culturally diverse countries who were tested with 
standardized materials under standardized conditions; the integra- 
tion of CFA, SEM, and multiple-group and multilevel modeling 
into a single analytic framework; and decomposition of differences 
in the BFLPEs associated with age cohort, country, and their 
interaction. In these respects, our study is a strong exemplar of the 
methodological-substantive synergies that apply evolving statisti- 
cal methodology to substantively and theoretically important is- 
sues that have policy and practice implications. Nevertheless, as is 
always the case, there are important limitations that may provide 
the basis of, further research. 

Reliance on cross-sectional data for only two age cohorts. 
Reliance on only two cross-sectional age cohorts requires addi- 
tional caveats in the interpretation of the results from a develop- 
mental perspective. For example, the apparent differences as a 
function of age might also be a function of birth cohort effects, and 
we were not able to evaluate how consistent the effects of age were 
for different individuals. Nevertheless, there are also some limita- 
tions with longitudinal data (e.g., generalizability of the results to 
other age cohorts; complications in sampling designs, missing 
data, representativeness of data within each country, and compa- 


rability across countries—particularly in relation to tracking stu- 
dents from primary to secondary schools). Ultimately, the “best” 
developmental description of how the BFLPE must incorporate 
findings from both cross-sectional and longitudinal studies, more 
fully evaluate developmental aspects of the BFLPE, and use a 
wider range of ages based on a combination of multicohort and 
longitudinal data. 

Assumptions of causality and underlying processes. Support 
for the BFLPE—and contextual models more generally—is largely 
based on cross-sectional, correlational studies, so that causal in- 
terpretations should be offered tentatively and interpreted cau- 
tiously. In particular, the “third variable” problem is always a 
threat to contextual studies that do not involve random assignment, 
but Marsh, Hau, and Craven (2004; Marsh, Seaton, et al., 2008) 
argue that this is an unlikely counterexplanation of BFLPE results, 
in that most potential third variables (resources, per student ex- 
penditures, socioeconomic status, teacher qualifications, etc.) are 
positively related to class- or school-average achievement, so that 
controlling for them would typically increase the size of the 
BFLPE. Fortunately, there is now a growing body of BFLPE 
research using various combinations of longitudinal, quasi- 
experimental, and true experimental designs that all support the 
BFLPE (see Marsh, 2007; Marsh, Seaton, et al., 2008; Nagengast 
& Marsh, 2012). Quasi-experimental, longitudinal studies (e.g., 
Marsh, Kong, & Hau, 2000) show that students’ ASC declines 
when students shift from mixed-ability schools to academically 
selective schools over time (based on pre- and posttest compari- 
sons) and compared to students matched on academic ability who 
continue to attend mixed-ability schools. There is support for the 
BFLPE in studies where achievement is based on tests adminis- 
tered before students began high school (e.g., Marsh et al., 2000). 
Extended longitudinal studies (Marsh et al., 2000; Marsh, Trau- 
twein, Liidtke, Baumert, & KGller, 2007) show that the BFLPE 
grows more negative the longer students attend a selective school 
and is maintained even 2 and 4 years after graduation from high 
school. Also, there is good support for the theoretical underpin- 
nings of the BFLPE, as it is largely limited to academic compo- 
nents of self-concept and nearly unrelated to nonacademic com- 
ponents of self-concept and self-esteem (Marsh, 1987; Marsh & 
Parker, 1984). However, further longitudinal and intervention 
studies would be useful to bolster the case for mediation of the 
effects of L1 and L2 achievement on subsequent achievement and 
educational attainment by ASC. 

Also implicit in the BFLPE is the assumption that the direction 
of causal ordering is from class- or school-average ability to ASC. 
Although apparently reasonable, this implicit causal ordering can- 
not easily be tested with cross-sectional data. However, there are 
also studies in support of the BFLPE based on longitudinal data 
where the temporal ordering is more clear-cut and for true exper- 
imental studies in which class- or school-average achievement is 
experimentally manipulated. 


- Policy Implications 


Our study greatly extends the generality of the negative effects 
of attending classes and schools where the average ability level of 
classmates is high, demonstrating the cross-cultural generalizabil- 
ity to primary school children as well as secondary school adoles- 
cents. These results also greatly expand the scope of support for 
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the universality of the BFLPE as a panhuman theory that has 
previously been based primarily on PISA data (Seaton et al., 
2009). Indeed, our results suggest that the negative effects of 
school-average achievement based on PISA might substantially 
underestimate the results for more proximally relevant measures of 
class-average achievement based on TIMSS. Although theoreti- 
cally important, these findings are worrisome, as ASC is well 
known to be an important predictor of academic choice and long- 
term engagement (Guay, Marsh, & Boivin, 2003; Marsh, 1991; 
Marsh & Craven, 2006; Marsh & O’ Mara, 2010; Marsh & Yeung, 
1997). Particularly when so many parents, teachers, and policy 
analysts uncritically assume that academic selective schools must 
automatically benefit the students who attend them, it is important 
to provide an alternative perspective based on strong theory and 
rigorous research. More generally, BFLPE research provides an 
alternative, contradictory perspective to educational policy on the 
placement of students in special education settings, which is a 
hotly debated topic in many countries throughout the world. 
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Social Consequences of Academic Teaming in Middle School: 
The Influence of Shared Course Taking on Peer Victimization 


Leslie Echols 
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This study examined the influence of academic teaming (i.e., sharing academic classes with the same 
classmates) on the relationship between social preference and peer victimization among 6th-grade 
students in middle school. Approximately 1,000 participants were drawn from 5 middle schools that 
varied in their practice of academic teaming. A novel methodology for measuring academic teaming at 
the individual level was employed, in which students received their own teaming score based on the 
unique set of classmates with whom they shared academic courses in their class schedule. On the basis 
of both peer- and self-reports of victimization, the results of 2 path models indicated that students with 
low social preference in highly teamed classroom environments were more victimized than low- 
preference students who experienced less teaming throughout the school day. This effect was exaggerated 
in higher performing classrooms. Implications for the practice of academic teaming were discussed. 


Keywords: academic teaming, ability grouping, middle school, social preference, peer victimization 


The early middle school years are rife with social and academic 
challenges as children make the move from elementary to second- 
ary education. Many children experience decreases in school liking 
and engagement as they navigate the new middle school environ- 
ment (Burchinal, Roberts, Zeisel, & Rowley, 2008)—an environ- 
ment in which peer aggression is at its peak (Eslea et al., 2003; 
Seals & Young, 2003). Victims of such aggression are at height- 
ened risk.of academic difficulties (Erath, Flanagan, & Bierman, 
2008; Juvonen, Wang, & Espinoza, 2011; Nakamoto & Schwartz, 
2010; Schwartz, Gorman, Dodge, Pettit, & Bates, 2008); and may 
also suffer from a wide range of internalizing (depression, loneli- 
ness, low self-esteem) and externalizing (aggression, delinquency, 
poor self-regulation) symptoms (Hanish & Guerra, 2002; Hodges, 
Malone, & Perry, 1997; Schwartz et al., 2008. 

In middle school, when fitting in and being accepted by peers 
are top social priorities (see Fournier, 2009), having low social 
preference among peers (i.e., being more disliked than liked) may 
increase the risk of peer victimization. Although low social pref- 
erence may not always be associated with peer victimization, 
having low social preference among peers and being the victim of 
peer aggression consistently leads to the most negative adjustment 
outcomes (Hodges et al., 1997; Sandstrom & Cillessen, 2003). It is 
important, therefore, to understand the conditions under which low 
social preference among peers does contribute to peer victimiza- 
tion. 

Previous research has documented certain individual character- 
istics (e.g., low impulse control) that make some low preference 
children vulnerable to peer victimization (cf. Sandstrom & Cil- 
lessen, 2003), but no studies to date have considered the role of 
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school context, or how instruction is organized, in the relation 
between low social preference and peer victimization. This is 
surprising given the recognition among scholars that school con- 
text may explain vulnerability to peer victimization when individ- 
ual characteristics fail to do so (see Brown, 1996; Merten, 1996). 
For example, Kochenderfer-Ladd and Skinner (2002) posited that 
repeated exposure of aggressors to their targets due to placement in 
the same classroom and/or school may contribute to the stability of 
victimization across the school years. Brown (1996) likewise sug- 
gested that a restricted range of peer encounters at school (as 
opposed to mixing with a wider range of grade mates) may 
contribute to reputation formation and fewer opportunities for 
change. The purpose of this study, therefore, was to examine 
whether the way in which students are grouped together in middle 
school— defined as their likelihood of taking classes with the same 
classmates—contributes to victimization for children with low 
social preference among their peers. 


Interdisciplinary Teaming in Middle School 


In middle school, the extent to which children share their classes 
with the same classmates, and therefore the likelihood of repeated 
contact with aggressors, is often influenced by whether interdisci- 
plinary teaming is practiced in their school. Interdisciplinary team- 
ing consists of a core set of teachers responsible for teaching the 
same group of students—typically a subset of same-grade students 
in the school population—with the intended benefits of greater 
collaboration among teachers and greater community among 
teachers and students, particularly during the transition to middle 
school (Thompson & Homestead, 2004). These benefits are well 
documented in the literature. For example, past research demon- 
strates that students in interdisciplinary teams have higher scores 
on standardized achievement tests, are more academically en- 
gaged, and have greater feelings of school belonging (Boyer & 
Bishop, 2004; Flowers & Mertens, 2003; Flowers, Mertens, & 
Mulhall, 1999; Lee & Smith, 1993; Wallace, 2007). 
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Although the practice of interdisciplinary teaming is wide- 
spread— estimated to be in use in nearly 80% of all U.S. middle 
schools (McEwin, Dickinson, & Jenkins, 2003)—little is known 
about the role of interdisciplinary teaming in children’s relation- 
ships with and treatment by their peers. Although students change 
classrooms and teachers each period when they are teamed, this 
practice may restrict their exposure to the general student body at 
their school because their classmates are always composed of 
members of their team. Social preference may therefore be deter- 
mined largely by the reputations formed within their team. Popular 
or well-liked students may enjoy taking classes with many of the same 
classmates who regard them as high status members of the peer 
group, but peer-rejected or disliked students may suffer the con- 
sequences of being repeatedly exposed to classmates with whom 
they have negative social relationships. In other words, interdisci- 
plinary teaming may be socially beneficial for high-preference 
children but detrimental for children with low social preference, 
who must endure a poor reputation throughout the majority of the 
school day. 


Academic Teaming: A Special Case of 
Interdisciplinary Teaming 


In middle schools in which interdisciplinary teaming is prac- 
ticed, expesure to the same classmates throughout the school day 
may be influenced by the type and amount of interdisciplinary 
teaming that occurs. For example, in schools with a small number 
of teams relative to the size of the grade-level population, children 
may not have the same set of classmates each period, even though 
their classmates always come from the same pool (team) of stu- 
dents. On the other hand, in schools where there is a large number 
of teams relative to the size of the grade-level population (i.e., each 
team comprises only one classroom of students), interdisciplinary 
teaming would result in the same classmates traveling together 
from course to course for all of their academic classes—a special 
case of interdisciplinary teaming referred to here as academic 
teaming. 

Unfortunately, the teaming literature does not differentiate be- 
tween interdisciplinary teaming in general and the more specific 
case of academic teaming. In fact, one major limitation of previous 
research is that interdisciplinary teaming has been measured as a 
school-level dichotomous indicator (practiced/not practiced), mak- 
ing it virtually impossible to investigate individual outcomes as- 
sociated with the extent of teaming that occurs. So although 
teaming may appear to have a positive effect on middle school 
adjustment for children overall, it is unclear whether there might 
be negative outcomes associated with the practice of teaming for 
some children, particularly those with low social preference among 
their peers. 


Academic Teaming and Peer Victimization 


Empirical research on social reputations indicates that peer 
status is less stable across changing peer settings than in settings in 
which peers remain the same (Bukowski & Newcomb, 1984; Coie 
& Kupersmidt, 1983). When considering the role of academic 
teaming in peer victimization, this research suggests that the rela- 
tion between low social preference and victimization might be 
stronger when academic teaming is practiced and weaker when it 


is not. To illustrate, for low-preference children in middle school, 
changing classes and classmates each period of the school day may 
help reduce their visibility among peers and their likelihood of 
being victimized because each class would be composed of a 
different set of classmates and social norms. In other words, 
children with low social preference who share the fewest number 
of classes with the same peers may have the most opportunities to 
avoid victimization. On the contrary, if the middle school structure 
is such that children take classes primarily with the same set of 
classmates, even when they change classrooms, social status hier- 
archies may be more salient to the peer group, increasing the 
probability that children with low social preference would also 
experience peer victimization. 

Ability grouping. In many schools, interdisciplinary teams 
are composed of students with similar academic profiles, and 
students share all their classes with classmates performing at the 
same academic level (Ansalone, 2001, 2006; Dauber, Alexander, 
& Entwisle, 1996; Eccles, Midgley, & Wigfield, 1993; Oakes, 
1981). Thus, in practice, teaming may be synonymous with ability 
grouping or academic tracking. In order to isolate the true effect of 
teaming, independent of ability grouping, it is therefore necessary 
to also consider the role of classroom academic performance (e.g., 
achievement level among classmates) in the relation between 
academic teaming, social preference, and peer victimization. The 
existing literature can help us understand how this unique set of 
individual and classroom characteristics may interact. For exam- 
ple, it is well documented that children who are performing well 
academically are more likely to be popular among (i.e., liked or 
accepted by) their peers (DeRosier, Kupersmidt, & Patterson, 
1994; Guay, Boivin, & Hodges, 1999; Meijs, Cillessen, Scholte, 
Segers, & Spijkerman, 2010). As such, higher performing class- 
rooms may be composed of a greater concentration of children 
with high social preference. It reasons that children with low social 
preference in such a context would be more likely to “stick out,” 
thus increasing their risk for victimization; this risk may be com- 
pounded if, due to academic teaming, these low-preference chil- 
dren remain with the same high-preference classmates throughout 
their academic schedule. 


The Present Study 


Although the academic benefits of interdisciplinary teaming in 
middle school are well understood, the social consequences asso- 
ciated with this common educational practice have been relatively 
unexplored. Certain types of interdisciplinary teaming, such as 
academic teaming, might increase the visibility of children’s rep- 
utations in the peer group. For high-preference children, this 
visibility could result in social benefits, but for low-preference 
children, academic teaming could make them more vulnerable to 
peer maltreatment, such as being victimized. The primary objec- 
tive of this study, therefore, was to investigate the influence of 
academic teaming on the relation between social preference and 
the likelihood of victimization among peers. Because academic 
teaming is often practiced in conjunction with ability grouping, the 
next objective of this study was to examine whether classroom 
academic performance plays a role in the influence of academic 
teaming on this relation. 

To achieve these objectives, some other limitations of the 
interdisciplinary-teaming literature were addressed. Rather than 
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rely on a school-level dichotomous indicator of teaming (prac- 
ticed/not practiced), an individualized and continuous measure of 
teaming was developed for this study in order to account for the 
extent to which students share their classes with the same class- 
mates across the academic subjects in their course schedule. Un- 
like much of the previous research that lacked a developmental 
analysis, this study was conducted with a large sample of sixth- 
grade students to capture the social effects of academic teaming 
during the transition year into middle school, when reputations and 
social hierarchies are being formed. To allow adequate time for 
these social processes to develop, the influence of academic team- 
ing on the relation between social preference and peer victimiza- 
tion was examined in the spring of sixth grade (controlling for 
social preference and victimization in the fall). It was hypothesized 
that low social preference would be associated with greater vic- 
timization, especially for students who experienced greater aca- 
demic teaming, thus being repeatedly exposed to the same class- 
mates throughout the course of the school day. 


Method 


Participants 


Participants were drawn from a larger sample of 5,076 sixth- 
graders across two cohorts of students participating in the UCLA 
Middle School Diversity Project (MSDP), a longitudinal study of 
middle school adjustment in ethnically diverse schools from 
Northern and Southern California. Students were enrolled in one of 
20 schools that varied in ethnic composition. To reduce confounds 
of ethnic diversity with socioeconomic status (SES), schools at the 
extremes of the SES continuum were avoided; only schools within 
a 20-80% range of free and/or reduced-price meal (FRPM) eli- 
gibility were recruited for the study. At the time of this study, 
school records were available for 19 of the 20 schools. 

As is typical for sixth-grade students in California middle 
schools, all students were enrolled in a different subject with a 
different teacher each period and rotated classrooms throughout 
the school day. However, the extent to which students’ classmates 
differed or stayed the same from course to course varied by school. 
In many California middle schools, for example, the same group of 
classmates travels together from course to course, a practice re- 
ferred to here as academic teaming. Using students’ class sched- 
ules and the index of academic teaming described in detail below, 
participants were selected if they attended a school with significant 
within-school variability in academic teaming (i.e., some variation 
in classmates from course to course). Regardless of the extent of 
teaming practiced, however, students in these schools kept the same class 
schedule from fall to spring semester such that the classmates with 
whom they shared their courses remained the same throughout the 
academic year. 

In order to examine whether high- or low teaming affects the 
relation between social preference and victimization, a subset of 
schools from the large sample was selected in which there was 
sufficient variance in the practice of teaming. Only five of the 19 
schools for which class schedules*were available met this criterion 
(i.e., the proportion of classmates that remained the same across all 
academic subjects ranged, on average, from .21 to .65). These five 
schools did not differ significantly from the overall sample in 
terms of FRPM eligibility or overall Academic Performance Index 


(API) scores as reported by the California Department of Educa- 
tion (see Appendix). None of these schools housed special pro- 
grams or magnet (e.g., gifted/highly gifted, science) centers. For 
two of the five schools that had substantial within-school variance 
in teaming scores, there was a significant correlation between 
academic teaming and classroom academic performance (r = .67 
and —.40, respectively), suggesting that some schools may use 
teaming as a mechanism for ability grouping or academic tracking 
(e.g., grouping together remedial or honors students). The high 
(M = .92) average teaming scores for the remaining 14 schools in 
the larger sample demonstrate the prevalence of this middle school 
practice (see Appendix). 

The analytic sample for the current study comprised 1,044 
students (51.3% girls) from the 5 selected schools. The ethnic 
composition of the sample (based on student self-report) is as 
follows: 30.6% Latino/Mexican, 22.6% Asian (East/Southeast/ 
South), 12.5% White, 11.9% African American, 3.0% Filipino/ 
Pacific Islander, 14.2% multiethnic/biracial, and 5.2% other. 


Procedure 


Beginning in the fall of 2009, students with signed parental 
consent completed a questionnaire during a single period in one of 
their sixth-grade classes. Students recorded their answers indepen- 
dently as they followed instructions being read aloud by a graduate 
research assistant who reminded them of the confidentiality of 
their responses. A second researcher circulated around the class- 
room to help students as needed. This procedure was repeated 
(approximately 5 months later) in the spring semester of sixth 
grade. At both waves of data collection, students were given an 
honorarium of $5 for completing the questionnaire. 


Measures 


Social preference. Social preference among peers was deter- 
mined by peer nomination. In both the fall and spring of sixth 
grade, students were presented with a roster containing the names 
of all students in their grade level at their school, arranged by name 
(alphabetically by first name) and gender. Given the rotating 
structure of courses in California middle schools and the opportu- 
nity for interaction with many other grade mates throughout the 
school day, grade-level rosters were determined to be more appro- 
priate than classroom-level rosters, which would have been limited 
to one set of peers to whom students were exposed in a given 
school day. Using the roster, students were instructed to record the 
names of their classmates in response to the questions, “Which 
sixth-grade students from your list would you like to hang out with 
at school?” and “Which sixth-grade students from your list do you 
not like to hang out with at school?” Students were allowed to 
record as many names as they desired but were instructed not to 
nominate themselves. 

The conditional phrase “would you like to hang out with” was 
intentionally used as a measure of peer acceptance because it could 
include both whom students associated with already and whom 
they would like to hang out with if given the opportunity. Other 
peer acceptance measures commonly used in the literature (e.g., 
“who do you like the most at school?”) also capture both estab- 
lished and desired associations with peers (cf. Lease & Axelrod, 
2001; MacDonald & Cohen, 1995). 
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Similar to the procedure used by Coie, Dodge, and Coppotelli 
(1982), “not like” nominations received by each student were 
subtracted from “like” nominations received. Thus, social prefer- 
ence scores of 0 represented students who were equally liked and 
disliked, positive social preference scores (scores greater than 0) 
represented students who were liked more than they were disliked, 
and negative social preference scores (scores less than 0) repre- 
sented students who were disliked more than they were liked. 

Social impact. Social impact is often measured in conjunction 
with social preference in order to differentiate individuals with 
similar social preference scores who may be more or less known to 
members of peer group (Coie et al., 1982). For example, an 
individual who received five “like” nominations and five “not 
like” nominations may be more well known to peers than an 
individual who received one “like” nomination and one “not like” 
nomination, even though both individuals would be given a social 
preference score of 0. In other words, social impact is useful in 
detecting the strength of one’s reputation (positive or negative). In 
order to control for the influence of reputation strength on peer 
victimization, social impact in the spring was calculated for each 
participant and used as a covariate in all analyses. 

Victimization. Because social preference is a reputational mea- 
sure of status among peers, peer reports of victimization may be more 
highly correlated with social preference than with self-reports of peer 
victimization. For this reason, both peer-reported and self-reported 
measures of victimization were used in this study. 

Peer-reported victimization. On the same peer nomination 
measure as described above, students were instructed to record the 
names of their classmates in response to the question, “Which 
sixth-grade students from your list get picked on by other kids (get 
hit or pushed around, called bad names, talked about behind their 
backs)?” The total number of “picked on” nominations that each 
student received was then tallied to create a score of peer-reported 
victimization in both fall and spring of sixth grade. 

Self-reported victimization. At each wave of data collection, 
students answered seven items about how often someone in their 
school had engaged in some type of aggression toward them (e.g., 
“hit, kicked, or pushed you,” “called you bad names”) since the 
beginning of the school year. Responses ranged from 1 (never) to 5 
(almost every day). This new measure, created for the larger study, 
has been shown to relate to other indicators of social and emotional 
adjustment (see Lanza, Echols, & Graham, 2013). On the basis of 
high internal consistency in both the fall and spring of sixth grade 
(a = .86 and .87, respectively), a mean of these items was computed 
and used as a single score of self-reported victimization. 

Academic teaming. Students’ class schedules were used to 
measure the proportion of participants’ classmates who remained 
the same across all academic subjects during each semester. This 
proportion was calculated using an index of academic teaming that 
was created specifically for this study: 
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Using the above formula, the proportion of classmates in each 
academic class (C,) who were also in another academic class (Cy) 
was calculated for all possible academic course combinations in 
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each student’s class schedule. The sum of these proportions was 
then divided by the total number of possible academic course 
combinations (,,P>) to create an average proportion of students in 
each participant’s class schedule that remained the same through- 
out the academic subjects (i.e., math, science, English, social 
studies) in a given academic semester. The top half of the 
academic-teaming equation represents the overlap in classmates 
between two given academic courses (e.g., math and social stud- 
ies) totaled across all possible course combinations (i.e., math and 
social studies, math and science, social studies and science, etc.). 
The bottom half of the equation represents the number of possible 
academic course combinations when each course is paired with 
every other course. Possible scores on this teaming index range 
between O and 1, with scores closer to 1 representing a higher 
proportion of students in one academic course who were also in 
every other academic course (i.e., complete academic teaming). 
For example, a score of .25 would indicate that 25% of a student’s 
classmates remained the same across all four academic courses in 
his or her class schedule (low teaming), while a score of .75 would 
indicate that 75% of a student’s classmates remained the same 
across all four academic courses (high teaming). 

Classroom academic performance. Classroom academic 
performance was measured by average academic GPA among 
classmates according to the following procedure. First, on the basis 
of students’ semester grades provided in school records, grade 
point average (GPA) was calculated for all participants for each 
academic course in their class schedule. Next, average GPA across 
classmates in each academic course was calculated. Finally, aver- 
age classmate GPA in each course was averaged across the four 
academic courses in each participant’s class schedule. Because the 
average academic performance to which students are exposed in 
their classrooms varies for students in middle school depending on 
their course schedules, each participant received an average class- 
mate GPA score, ranging from 0 to 4, using the available school 
records data for participants. 

Academic deviation. To control for risk of victimization as- 
sociated with deviation from the norm for academic performance 
in students’ academic courses, a difference score was calculated 
for each participant and used as a covariate in all analyses. To 
calculate this difference score, average classmate GPA was sub- 
tracted from average individual GPA for academic courses. Posi- 
tive deviation scores represented students who were performing 
better than their classmates and negative deviation scores repre- 
sented students who were performing worse than their classmates. 


Planned Missing Design 


In the larger study from which this sample was drawn, a three- 
form planned missing design was implemented in order to increase 
the efficiency of collecting data from such a large number of 
participants (see Graham, Taylor, Olchowski, & Cumsille, 2006). 
With this design, participants were given one of three question- 
naires, each of which excluded a different set of measures, result- 
ing in missing data on these measures for one third of participants. 
Because “missingness” was planned (i.e., under the control of the 
researchers) and not a function of other measured or unmeasured 
variables, these missing data were assumed to be missing com- 
pletely at random (MCAR; see Little & Rubin, 1987). In the 
current study, only peer-reported victimization was part of the 


276 ECHOLS 


planned missing design. There was a minimal amount (<4%) of 
unplanned missing data for this measure and all other measures in 
this study. Missing data were handled using full information max- 
imum likelihood (FIML; described below). 


Results 


Descriptive statistics for all study variables are shown in Table 
1. Correlations among study variables are shown in Table 2. Given 
the potential for reciprocal relations between social preference and 
victimization over time, a path (e.g., cross-lagged) model based on 
a structural equation modeling framework in Mplus (Muthén & 
Muthén, 2012) was used in order to measure the influence of social 
preference in the spring of sixth grade on victimization in the 
spring of sixth grade while simultaneously accounting for the 
influence of social preference and victimization in the fall of sixth 
grade. Two sets of models were estimated: one for peer-reported 
victimization and one for self-reported victimization. In each 
model, FIML was used in order to make use of all available data 
from participants. FIML is considered the most appropriate esti- 
mation technique for structural equation models when missing data 
are MCAR (Enders & Bandalos, 2001). 

The results of the path models (described separately for peer- 
and self-reported victimization below) are shown in Table 3. 
Social preference and victimization in the fall and spring of sixth 
grade were modeled as observed variables. Individual path coef- 
ficients from covariates (gender, social impact, academic devia- 
tion) to observed variables are not shown, but all covariates were 
correlated with social preference and victimization in both fall and 
spring. In Step 1, academic teaming was included as a moderator 
of the relation between social preference and victimization in the 
spring of sixth grade, controlling for classroom academic perfor- 
mance. In Step 2, classroom academic performance was included 
in a three-way interaction term with academic teaming and social 
preference. As is standard practice when modeling higher order 
interaction terms, all lower order interaction terms (Social Prefer- 
ence X Academic Teaming, Social Preference < Classroom Ac- 
ademic Performance, Academic Teaming X Classroom Academic 
Performance) were included in Step 2. R? values are reported for 
each model in order to evaluate the proportion of variance ac- 
counted for at each step. To ensure adequate sample size to detect 
close model fit, a separate power analysis for each model was 


conducted following procedures outlined by MacCallum, Browne, 
and Sugawara (1996) for structural equation models (see Table 4). 
The power analyses confirmed sufficient sample size using even 
the most conservative criteria (power of .80 at a = .01). 


Peer-Reported Victimization 


Social preference in the spring of sixth grade had a significant 
impact on peer-reported victimization in the spring of sixth grade, 
even after controlling for all relations between social preference 
and peer-reported victimization in the fall of sixth grade and 
between fall and spring of sixth grade. As shown in Step | of Table 
3, the negative coefficient for this pathway indicates that as spring 
social preference increased, spring peer-reported victimization de- 
creased; conversely, as spring social preference decreased, spring 
peer-reported victimization increased. 

There was no main effect of academic teaming on peer-reported 
victimization in the spring. However, there was a significant 
interaction effect with social preference, indicating that the relation 
between social preference and peer-reported victimization in the 
spring was magnified by academic teaming. In other words, the 
more teaming students experienced, the greater the impact of 
social preference on peer-reported victimization. As shown in 
Figure 1, for students with high social preference, high teaming 
resulted in lower peer-reported victimization. For students with 
low social preference, however, high teaming resulted in particu- 
larly high peer-reported victimization. Thus, as hypothesized, ac- 
ademic teaming appears to be a risk factor for low-preference 
students at school. 

As shown in Step 2 of Table 3, there was also a significant 
three-way interaction between social preference, academic team- 
ing, and classroom academic performance; and the higher R* value 
indicates more variance in peer-reported victimization accounted 
for with this model. The three-way interaction is depicted in Figure 
2. Each plotted slope shows the relation between social preference 
and peer-reported victimization at varying levels of academic 
teaming and classroom academic performance. Higher teaming 
resulted in greater victimization for children with low social pref- 
erence in both higher and lower performing classrooms. In addi- 
tion, children with low social preference had greater peer-reported 
victimization in higher compared with lower performing classes 
regardless of the amount of academic teaming they experienced. 








Table 1 
Descriptive Statistics for Study Variables 

Variable Min Max M SD 
Social preference fall 6th Grade —19.00 20.00 2.50 3.35 
Social preference spring 6th grade —18.00 9.00 0.37 2.54 
Social impact fall 6th grade 0.00 26.00 4.49 3.70 
Social impact spring 6th grade 0.00 20.00 2.80 2.65 
Peer-reported victimization fall 6th grade 0.00 9.00 0.40 0.89 
Peer-reported victimization spring 6th grade 0.00 20.00 0.59 1.65 
Self-reported victimization fall 6th grade 1.00 5.00 1.60 0.66 
Self-reported victimization spring 6th grade 1.00 4.86 1.78 0.74 
Academic teaming spring 6th grade 0.05 0.99 0.36 0.22 
Classroom academic performance spring 6th grade Le) 3.76 2.86 0.38 
Academic deviation spring 6th grade ae) 2.19 0.08 0.79 


Oe 


Note. Min = minimum; Max = maximum; M = mean; SD = standard deviation. 
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Table 2 
Correlation Among Study Variables 


SS, 


Fall peer- Fall self- Spring self- Classroom 
; Fall social Spring social reported Spring peer- reported reported Academic academic Academic Social 
Variable pref. pref. vic. reported vic. vic. vic. teaming performance deviation impact 
Fall social pref. — 546"™ aa aa en ODD .016 — .038 ply, Siem 2155 ape Oa 
Spring social pref. = = 302) 30s 008s peel Soaeee 053 .148"™" AGmas see li Dra 
Fall peer-reported vic. _ (622-5 ALS 1e sate .031 .009 O12 12037"% 
Spring peer-reported vic. — eile Oleg .208"** .032 .025 —.075* E200) 
Fall self-reported vic. — 939 ¢ .064* SANTA Sly .106™ 
Spring self-reported vic. — .070 .002 =n Oe rl Saae 
Academic teaming “= SOO mae 012 — O00. 
Classroom academic 
performance — .079* .093"™* 
Academic deviation — —.099™" 


Social impact 


Note. pref. = preference. vic. = victimhood. Academic teaming, classroom academic performance, academic deviation, and social impact were measured 


only in spring. 
pps Ole, = p<. .O0r. 


Peer-reported victimization was greatest for children with low 
social preference in highly teamed, high performing classrooms. 


Self-Reported Victimization 


Similar to the model for peer-reported victimization, social 
preference in the spring of sixth grade had a significant impact 
(although smaller in magnitude) on self-reported victimization in 
the spring of sixth grade, even after controlling for all relations 
between social preference and self-reported victimization in the 
fall of sixth grade and between fall and spring of sixth grade. 
Again, the negative coefficient for this pathway indicates that as 
spring sociai preference increased, spring self-reported victimiza- 
tion decreased, and conversely that as spring social preference 
decreased, spring self-reported victimization increased. 


Table 3 


In Step 1 of this model there was a significant main effect of 
academic teaming on self-reported victimization in the spring, 
indicating that students who experienced greater teaming reported 
more victimization. As shown in Figure 3, there was also a 
significant interaction effect with social preference such that for 
students with low social preference, greater teaming resulted in 
particularly high self-reported victimization. Similar to the model 
for peer-reported victimization, the relation between social pref- 
erence and self-reported victimization was weakest when teaming 
was low. Unlike the model for peer-reported victimization, how- 
ever, the three-way interaction between social preference, aca- 
demic teaming, and classroom academic performance was not 
significant (see Step 2 of Table 3), indicating that the influence of 
academic teaming on the relation between social preference and 


Results of Path Models Testing the Moderating Role of Academic Teaming and Classroom Academic Performance on the Relation 


Between Social Preference and Victimization 


pu 


Peer-reported victimization 


Self-reported victimization 


Step 1 Step 2 Step i Step 2 
Est. (SE) Est. (SE) Est. (SE) Est. (SE) 
Intercept (spring victimization) .183 (.071)* 276043) 0 688 (.074)"** .662 (.077)*** 
Female —.162 (.081)* —.130 (.079) — .022 (.050) —.016 (.050) 
Social impact .082 (.016)*** .086 (.016)*** .033 (.010)™* .034 (.010)*** 
Academic deviation .019 (.052) .009 (.051) — .042 (.032) — .048 (.032) 
Fall victimization .987 (.047)*** .946 (.046)*** .606 (.037)*** .613 (.037)**™* 
Fall social preference — .027 (.015) —.020 (.015) .017 (.009) .019 (.009)* 
Spring social preference —.136 (.020)*** —.142 020)" = O251C 012): —.026 (.013)* 
Academic teaming .275 (.182) 627 (.199)*™* 336 (.115)™ SOAS )n 
Classroom academic performance .259 (.109)* .214 (.116) .208 (.068)** .243 (.073)** 
Spring Social Preference X Academic Teaming —.240 (.084)** —.385 (.085)*** —.145 (.051)™ —.178 (.057)™* 
Spring Social Preference X Classroom Academic Performance —.164 (.045)*** —.038 (.029) 
Academic Teaming < Classroom Academic Performance 1.386 (.478)™* —.219 (.302) 
Spring Social Preference < Academic Teaming < Classroom 
Academic Performance —.661 (.196)** —nLOSIGle5) 
488 379 379 


R2 


i nnn ee 


Note. Est. = estimate. SE = standard error. 
"p< 05: “p= Oli prs 200. 
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Table 4 
N for Test of Close Fit at Power = .80 For a = .O1, .05, and 
.10 With Varying Degrees of Freedom 





Model a= .01 a = .05 a = .10 
Step 1 (11 df) 834.38 612.50 504.69 
Step 2 (32 df) 420.31 315.63 264.06 





Note. df = degrees of freedom. Greater degrees of freedom reduce 
required sample size (see MacCallum, Browne, & Sugawara, 1996). 


self-reported victimization was the same regardless of the level of 
academic performance in children’s classes. 

To summarize these results, for both peer- and self-reported 
victimization, as social preference increased, victimization de- 
creased. Likewise, as social preference decreased, victimization 
increased. Predictably because of informant overlap, this effect 
appeared to be stronger for peer-reported than self-reported vic- 
timization. The interaction between social preference and aca- 
demic teaming was significant for both types of victimization, and 
the relation between low social preference and victimization was 
greater when teaming was high. For peer-reported victimization, 
the relation between low social preference and victimization was 
greatest when both academic teaming and classroom academic 
performance were high. 


Discussion 


In early adolescence, perhaps more so than in any other time in 
development, status among peers contributes largely to children’s 
social and emotional well-being and their overall adjustment in 
school (Wentzel, 2003). With many of these children using ag- 
gression to gain status (Pellegrini, 2002; Pellegrini & Long, 2002), 
having low status makes some children particularly vulnerable to 


peer victimization (Sandstrom & Cillessen, 2003). Certain educa- 
tional practices determine the type and amount of exposure chil- 
dren have to others in the peer group at school, which may affect 
their visibility as either high- or low-status members, further 
influencing their likelihood of being victimized. In particular, 
academic teaming influences the extent to which classmates re- 
main the same from class to class throughout the school day, which 
could make social status more or less salient to their peers. In this 
study, social preference was used as the measure of status among 
peers, and the moderating role of academic teaming on the asso- 
ciation between social preference and peer victimization for chil- 
dren in the sixth grade was examined. 

Consistent with past research, the results indicated a significant 
negative relation between social preference and peer victimization. 
Even after accounting for reciprocal relations among these vari- 
ables over time, lower social preference scores were associated 
with greater peer victimization in the spring of sixth grade accord- 
ing to both peer- and self-report. Academic teaming also had a 
significant negative effect on peer victimization, but only self- 
reported victimization, suggesting that the more children shared 
their classes with the same classmates, the more they perceived 
being victimized. For both peer- and self-reported victimization, 
academic teaming moderated the relation between social prefer- 
ence and victimization, increasing the risk of victimization among 
low preference peers. That is, regardless of the victimization 
measure that was used, low-preference children in highly teamed 
classes were more victimized than low-preference children who 
shared fewer classmates throughout the school day. 

There was an opposite effect of academic teaming on peer- 
reported victimization for children with high social preference. 
Although high-preference children were at decreased risk of peer 
victimization overall, this was especially true for high-preference 
children who experienced greater academic teaming. These results 
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Figure 1. The moderating role of academic teaming on the relation between social preference and peer- 
reported victimization in middle school. Low teaming = 25% shared classmates, moderate teaming = 50% 
shared classmates, high teaming = 75% shared classmates. SD = standard deviation. 
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Figure 2._ The moderating role of academic teaming and classroom academic performance on the relation 
between social preference and peer-reported victimization in middle school. Low teaming = 25% shared 
classmates, high teaming = 75% shared classmates. Low classroom academic performance = | SD below mean, 


support the hypothesis that academic teaming may increase the 
social visibility of children to their peers, which may be a promo- 
tive factor for children who enjoy high social preference in the 
peer group but a risk factor for low-preference children who rarely 
get the chance during the school day to escape their reputation. 
The influence of academic teaming on the relation between 
social preference and peer victimization was further moderated by 
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high classroom academic performance = 1 SD above mean. SD = standard deviation. 


classroom academic performance, such that children with low 
social preference were at greatest risk for victimization when they 
experienced higher levels of academic teaming and were taking 
classes with higher performing classmates. This effect was only 
observed for peer-reported victimization, suggesting that children 
with low social preference may not actually experience more 
victimization in higher performing classrooms but may stand out 


=— — Moderate Teaming 


coeooe Low Teaming 


1 SD Above Mean 


Social Preference Spring 6th Grade 


Figure 3. The moderating role of academic teaming on the relation between social preference and self-reported 


victimization in middle school. Low teaming = 25% shared classmates, moderate teaming 


50% shared 


classmates, high teaming = 75% shared classmates. SD = standard deviation. 
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more as victims compared with their higher preference peers. 
Because higher performing classrooms may be composed of more 
children with high social preference compared with average or 
lower performing classrooms (see Meijs et al., 2010; Newcomb, 
Bukowski, & Pattee, 1993), children who deviate from the norm of 
high social preference may be more visible to the peer group and 
therefore more easily identified as victims, especially if they are 
taking classes with the same classmates throughout the school day. 
Because social reputations are often difficult to change, especially 
in contexts in which peers remain the same (Brown, 1996), low- 
preference children in highly teamed, higher performing class- 
rooms may be at risk for chronic victimization and a host of 
adjustment difficulties that could follow. Future research should 
consider the role of academic teaming on the unique social trajec- 
tories of children in higher performing classrooms who are at 
opposite ends of the social status hierarchy. 


Strengths and Limitations 


This study makes important contributions to the literature on 
both peer victimization and interdisciplinary teaming. By compar- 
ing the influence of social preference across both peer- and self- 
reported victimization, this study demonstrates that the relation 
between social preference and peer victimization may be evident 
regardless of how victimization is measured. In addition, in this 
study it was suggested that social visibility is the mechanism 
through which academic teaming influences the relation between 
low social preference and peer victimization, especially in higher 
performing classrooms. Although social visibility was not directly 
measured, it was assumed that repeated exposure to the same 
classmates through the practice of academic teaming would in- 
crease visibility among peers. This study introduces a novel meth- 
odological tool for measuring exposure in the peer group, but 
future research should consider other approaches to measuring 
social visibility (e.g., being a member of a particular “crowd”). 

Most notably, this is the first study in which interdisciplinary 
teaming has been distinguished from academic teaming in order to 
measure the extent to which children shared their classes with the 
same classmates throughout the school day. Because measuring 
academic teaming at the individual level and as a continuous 
variable is the only way to investigate individual outcomes asso- 
ciated with teaming, this study provides an important first look at 
the negative social consequences that may result from this com- 
mon educational practice. Given the heightened risk of poor ad- 
justment in middle school for children who are victimized, the 
prevalence of this school practice is alarming. However, it should 
be noted that children may share their classes with many of the 
same classmates even when interdisciplinary teaming as an in- 
structional practice is not being utilized. Thus, the practice of 
academic teaming may or may not be synonymous with the prac- 
tice of interdisciplinary teaming in all schools. As such, there may 
not be social risk associated with interdisciplinary teaming per se, 
but, rather, the risk may only reside in repeated exposure to 
classmates. Future research should further differentiate between 
the practices of interdisciplinary and academic teaming and con- 
sider other individual and social risk factors that may make chil- 
dren more or less likely to benefit from teaming in all its various 
forms. 


Future Directions 


This study sets the stage for other important research on peer 
victimization and the measurement of classroom context in middle 
and high school education. In the present study, the overlap in 
peer- and self-reported victimization was not accounted for (.e., 
self-reported victimization was not included as a covariate in the 
model for peer-reported victimization and vice versa). However, it 
may be that children with high victimization scores on one mea- 
sure were not necessarily the same children with high victimiza- 
tion scores on the other measure. If there are indeed different 
subgroups of. victims in middle school, as suggested by Sandstrom 
and Cillessen (2003), it is possible that the same feature of the 
classroom context might affect members of these subgroups in 
different ways. For example, children who perceive victimization 
but are not identified as victims by their peers may suffer from a 
victim mentality that is neither explained nor influenced by their 
classroom context, whereas children who are identified as victims 
by their peers and themselves report being victimized may be 
particularly vulnerable when their classmates stay the same from 
course to course throughout the school day. Although this “comor- 
bidity” effect was not examined in the present study, future re- 
search might consider whether using these multiple informants has 
implications for assessing the risks associated with peer- versus 
self-perceived victimization in certain classroom contexts. 

In this study, students’ class schedules were used to create 
individualized measures of classroom characteristics (e.g., class- 
room academic performance). This appears to be a promising new 
approach to measuring classroom context for students in middle- 
and secondary education settings that has some noteworthy advan- 
tages. First, this method makes it possible to detect differences in 
the influence of classroom context at various levels of measure- 
ment: between classrooms and schools, between students depend- 
ing on their course schedules, and even between courses taken by 
the same student. Next, this method may remove the need for 
multilevel modeling if nearly all the variance in classroom context 
resides between students (within schools) and not between class- 
rooms or schools. When multilevel modeling is necessary, class- 
room context measured at the individual level may substantially 
increase the number of Level 2 units (e.g., if classroom character- 
istics across courses are nested within individuals nested within 
schools). Most important, with this method, the individual expe- 
riences of children in middle and high school can be understood in 
ways never before examined. Instead of relying on measures of 
context specific to one classroom or school, this method makes it 
possible to investigate context across classrooms specific to each 
child. In other words, the entire school day as experienced by 
individual children as they travel from class to class can now be 
observed. Although the primary contribution of this study is the 
substantive understanding of the role of academic teaming in 
schoolchildren’s social adjustment in school, it is the hope that this 
novel approach to measuring classroom context will also make a 
significant methodological contribution to the literature. 


Implications for Practice 


Although interdisciplinary teaming may lead to some positive 
outcomes such as greater feelings of belonging in school, academic 
teaming may come with certain social costs that outweigh these 
benefits. During a time when status among peers is critical to 
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overall adjustment and the risk of being victimized is so high, 
researchers and practitioners would do well to consider the extent 
to which this practice should be used, particularly for vulnerable 
children in the peer group. For example, a less-restrictive teaming 
structure (e.g., large-enough teams, so that students are not re- 
quired to share all their academic courses with the same class- 
mates) might provide the academic benefits of this practice while 
avoiding the social costs. 

A relatively simple intervention strategy for reducing victimiza- 
tion among children with low social preference in the peer group 
would involve scheduling their courses in a way that would pro- 
vide them with maximum exposure to a diverse set of peers. 
Because school counselors are often responsible for course sched- 
uling, one important topic for future research is the role that school 
counselors play in addressing the academic and social needs of 
their students through course scheduling practices. 

When a teaming structure is imposed by the school or district, it 
might also be important to consider how teachers in their individ- 
ual classrooms might organize instruction in order to minimize the 
negative impact of this practice on students with low social pref- 
erence among their peers. For example, are students further clus- 
tered together (e.g., in the case of small group instruction) with a 
particular set of classmates, or do they have the opportunity to 
interact with a variety of students in class? Is seating by student 
choice, or do teachers implement seating charts? As both these 
factors could influence the extent of exposure of low-preference 
students to the same classmates, these are important topics for 
future research. 

Students’ social and academic lives are interrelated and are 
closely tied to their overall adjustment in school; it is therefore 
important to consider both the academic and social ramifications 
of any instructional practice. Until now, only the academic benefits 
of teaming have been considered. However, because interdisciplin- 
ary teaming, in general, and academic teaming, in particular, have 
a direct impact on the type and extent of social contact that 
children experience, the practice of teaming may be especially 
relevant to children’s social adjustment in school. It is the hope 
that the findings reported here will stimulate other research on the 
benefits and risks associated with the practice of teaming for 
children in middle school and that the methodology used here will 
make it possible to examine whether such outcomes apply to all, or 
just some, children in the classroom. 
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Appendix 
School Characteristics of MSDP Schools 








School FRPM API Academic Teaming 
i 0.42 807 0.21 
2 0.31 832 0.31 
3 0.80 850 0.39 
4 0.68 704 0.43 
5 0.54 708 0.65 
6 0.56 825 0.92 
7 0.29 846 0.93 
8 0.72 650 0.93 
9 0.29 836 0.94 

10 0.67 810 0.94 
11 0.68 658 0.95 
12 0.50 755 0.95 
13 0.57 TU 0.95 
14 0.77 806 0.96 
15 0.38 838 0.96 
16 0.21 889 0.97 
17 0.72 681 0.99 
13° 0.45 831 0.99 
19 0.43 839 1.00 


Note. MSDP = Middle School Diversity Project; FRPM = free and reduced-price meal eligibility; API = Academic 
_ Performance Index. Academic teaming scores were based on average academic teaming experienced by participants in the 
same school. Schools 1-5 were used in the analyses reported in this study. 
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Using nationally representative data from the Longitudinal Study of Australian Children (LSAC; N = 
5,107), this study assessed prospective connections between children’s early education and care (EEC) 
experiences from infancy through preschool and their cognitive and behavioral functioning in Ist grade. 
Incorporating 6 waves of data, analyses found that greater duration and intensity of exposure to center 
EEC settings predicted heightened fluid intelligence but also decreased behavioral functioning across 
multiple realms and reporters. Assessment of the timing of exposure found that the combination of 
infant/toddler and preschool center EEC, rather than only preschool EEC, drove these patterns. Results 
largely replicate patterns from U.S. studies, suggesting the importance of identifying EEC programs and 
models that can support children’s behavioral as well as cognitive skills. In contrast to U.S. results, 
associations between center EEC and children’s later functioning did not extend to basic academic skills 
and were not moderated by family socioeconomic resources or child temperament. 


Keywords: child care, early childhood education, school readiness, international comparison, propensity 


score weighting 


Early education and care (EEC) programs serve diverse needs 
for children and families, from promoting children’s cognitive and 
behavioral skills, to supporting parental employment, to promoting 
equality and cultural norms. As such, governments across many 
countries are directing resources toward accessible, affordable, and 
high quality early education and care programs, using diverse 
policy levers such as quality regulations, federal or state subsidies, 
and family tax breaks (Organization for Economic Cooperation 
and Development [OECD], 2006). An extensive body of research 
has assessed how attending EEC programs affects children’s core 
cognitive and behavioral functioning. This research has consis- 
tently found that children who attend early education programs, 
particularly center-based programs in the year or two prior to 
kindergarten, show enhanced growth in cognitive skills in com- 
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parison to their peers. At the same time, research also has sug- 
gested that center EEC, particularly when full-time and begun 
early in life, may be detrimental for later behavioral functioning 
(e.g., Coley, Votruba-Drzal, Miller, & Koury, 2013; Magnuson, 
Ruhm, & Waldfogel, 2007; NICHD Early Child Care Research 
Network [ECCRN], 2003; Phillips, McCartney, & Sussman, 
2006). The vast majority of the literature on EEC derives from 
U.S. studies. Although findings have been quite robust across 
numerous longitudinal data sets of children from the United States, 
there has been little replication across other countries (although 
see, e.g., Coté, Borge, Geoffroy, Rutter, & Tremblay, 2008; Geof- 
froy et al., 2010). As such, we have limited knowledge concerning 
other policy models of EEC and whether effects of EEC on 
children’s cognitive and behavioral skills are generalizable to 
different populations and in diverse policy and cultural environ- 
ments. 

With a relatively similar economic and cultural context yet 
differences in EEC policy and use, Australia offers an interesting 
context to replicate this research. In this article we assessed links 
between the duration, intensity, and timing of center-based EEC 
from infancy through preschool and children’s cognitive and be- 
havioral skills in first grade in a nationally representative sample 
of Australian children. We further addressed whether such asso- 
ciations differed across children from more or less advantaged 
home environments, considering family income, parent educa- 
tional attainment, and enriching home environments, and across 
children with easier versus more difficult temperaments. This work 
presents one of the only assessments of the long-term implications 
of EEC in Australia. Further, it seeks to expand the broader 
literature on EEC and children’s functioning by contrasting the 
roles of the duration, intensity, and timing of exposure to center- 
based EEC programs using rigorous statistical models to help 
adjust for selection bias and unmeasured heterogeneity. 
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We base our analyses on theories of child development that 
point to the importance of responsive, stimulating caregiving and 
children’s responses to environmental stressors in the first years of 
life. During infancy and early childhood, as children’s cognitive, 
social, and emotional skills develop rapidly, supportive interac- 
tions, responsive caregiving, and safe, stimulating learning envi- 
ronments are especially important (Early & Burchinal, 2001). For 
infants in particular, a consistent and responsive child-caregiver 
relationship is essential to promote secure child-caregiver attach- 
ments while providing opportunities for children to safely explore 
their environment. Thus, nonparental care may be less supportive 
for infants’ development, most notably when the care is provided 
with larger groups of children or less one-on-one child—adult 
interaction, as may be found in center programs (Dowsett, Huston, 
Imes, & Gennetian, 2008). Further, center EEC programs may 
expose young children to greater physiological stress, as recent 
research indicates that experiences in center EEC put infants and 
toddlers at greater risks than older children for cortisol elevations 
throughout the school day which in turn may impede children’s 
early emotional and cognitive development (Dettling, Gunnar, & 
Donzella, 1999; Vermeer & van IJzendoorn, 2006; Watamura, 
Donzella, Alwin, & Gunnar, 2003). Preschool-aged children, in 
contrast, have developed enhanced language skills, emotional reg- 
ulation, and social skills in comparison to their younger counter- 
parts, and thus are likely to experience less stress from center- 
based EEC (Dettling et al., 1999; Vermeer & van IJzendoorn, 
2006; Watamura et al., 2003). Center-based preschools may better 
support preschool-aged children’s cognitive skills, as there are 
more opportunities to experience structured and diverse educa- 
tional curricula in centers than in parent care or home-based EEC 
(Coley, Li-Grining, & Chase-Lansdale, 2006; Dowsett et al., 2008; 
Fuller, Kagan, Loeb, Chang, 2004; Maccoby & Lewis, 2003). 

There are also theoretical perspectives to suggest that EEC 
experiences may be differentially influential across subgroups of 
children (Bradley, McKelvey, & Whiteside-Mansell, 2011). Chil- 
dren from low-resource home environments due to limited eco- 
nomic resources or low parental education typically experience 
less enriching, stimulating, warm, and consistent home environ- 
ments than their counterparts in economically advantaged families 
(Magnuson & Votruba-Drzal, 2009). Compensatory models of 
EEC argue that EEC programs may provide a particularly impor- 
tant resource to bolster the early academic and behavioral skills of 
children from homes with limited socioeconomic resources (Loeb, 
Bridges, Bassok, Fuller, & Rumberger, 2007; Loeb, Fuller, Kagan, 
& Carrol, 2004; McCartney, Dearing, Taylor, & Bub, 2007; 
Votruba-Drzal, Coley, Koury, & Miller, 2013). Theories of differ- 
ential susceptibility, in contrast, argue that EEC may be more 
influential for children with more difficult and challenging tem- 
peraments, although evidence supporting this theory are rather 
sparse and focus on EEC quality rather than type and quantity and 
on behavioral but not cognitive arenas of child functioning (Belsky 
& Pluess, 2011; Pluess & Belsky, 2010). 


Empirical Review of Early Childhood Education and 
Children’s School Readiness Skills 


Understanding the repercussions of EEC programs for chil- 
dren’s core cognitive and behavioral skills is essential, because 
such skills are key predictors of successful transitions into formal 


schooling and long-term educational success (Entwisle & Alexan- 
der, 1993; Li-Grining, Votruba-Drzal, Maldonado-Carrefio, & 
Haas, 2010). Below we briefly review the empirical evidence, 
drawn primarily from studies of American children. 

Numerous studies have found that children who attend EEC, 
particularly center-based preschool programs, show enhanced 
early reading and numeracy skills in comparison to their peers in 
parent care or more informal home-based care settings, differences 
that extend into elementary school (Gormley, Gayer, Phillips, & 
Dawson, 2005; Loeb et al., 2007; Magnuson, Meyers, Ruhm, & 
Waldfogel, 2004; Morrissey, 2010; NICHD ECCRN, 2002, 2005; 
Duncan & NICHD ECCRN, 2003). One of the primary limitations 
of much of this work is the focus solely on preschool care. In 
nonexperimental studies that limit attention to children’s EEC 
experiences in the year or two prior to formal school, models fail 
to adequately delineate whether associations between center-based 
preschool and children’s enhanced cognitive skills in elementary 
school are driven by preschool experiences or rather by correlated 
experiences earlier in childhood. These models also fail to demar- 
cate whether earlier EEC experiences show unique links with 
children’s later cognitive skills. 

A handful of studies from the United States suggest that center- 
based EEC during infancy and the early toddler years may be less 
beneficial for children’s later cognitive skills than center EEC 
during the late toddler and preschool years (Loeb et al., 2007; 
Duncan & NICHD ECCRN, 2003; Votruba-Drzal et al., 2013). For 
example, Loeb et al.’s (2007) study using retrospective reports of 
the timing of center EEC initiation indicated starting center EEC 
between 2 and 3 years of age was associated with greater kinder- 
garten cognitive skills than starting EEC earlier or later. Duncan 
and NICHD ECCRN (2003) found similar results using prospec- 
tive data, arguing that greater exposure to center care between 2.5 
and 4.5 years had the strongest associations with children’s cog- 
nitive skills in kindergarten, with no benefits of earlier center EEC. 
One interpretation of these results rests on developmental timing, 
suggesting that center-based EEC is most cognitively supportive 
for children over age 2. A second interpretation is a duration of 
exposure argument, suggesting that 2 to 3 years of center care is 
more supportive of children’s cognitive skill development than 
more or fewer years. A third argument concerns the intensity of 
exposure, suggesting that moderate rather than limited or extensive 
exposure to EEC is most beneficial (Votruba-Drzal et al., 2013). 
Few studies have tried to delineate the relative merits of these 
perspectives. 

Only one published article of which we are aware has as- 
sessed such associations using nationally representative data 
from Australia (Coley, Lombardi, Sims, & Votruba-Drzal, 
2013). Following children from infancy through the transition 
to elementary school, this article considered children’s center 
EEC exposure at 9 months, 2 years, and 4 years and their 
cognitive, language, and reasoning skills at age 7. In contrast to 
a large body of research on American children, this study found 
no benefits of center-based preschool programs for 4-year-olds 
but, rather, reported that center care at age 2 was most predic- 
tive of children’s later cognitive skills (Coley, Lombardi, et al., 
2013). Although this study found benefits of both part-time and 
full-time center care, it did not assess the accumulation of EEC 
experiences over time, leaving open questions concerning whether 
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the apparent benefits of center EEC programs for toddlers were 
related to greater overall EEC exposure. 

In contrast to the relatively consistent benefits of center-based 
EEC programs for children’s cognitive skills, research regarding 
children’s behavioral skills paints a less positive picture. For 
example, numerous longitudinal correlational studies find that 
children attending center-based EEC show higher rates of aggres- 
sion, disruptiveness, and other externalizing problem behaviors 
than their peers who use informal home-based EEC or parent care, 
with small associations emerging during early childhood and re- 
maining statistically significant through early elementary school 
and, in some cases, even into adolescence (Belsky et al., 2007; 
Magnuson et al., 2007). The strongest and most consistent asso- 
ciations appear to be with children’s externalizing problems, with 
numerous studies also identifying links with lower attention skills 
and less consistent associations with prosocial behaviors, and with 
associations generally stronger with teacher than parent reports of 
children’s behavioral functioning (Coley, Votruba-Drzal, et al., 
2013; Loeb et al., 2007; Magnuson et al., 2007; NICHD ECCRN, 
2003, 2006). 

In the behavioral arena, previous research has unearthed evi- 
dence that both the duration and intensity of center EEC are 
associated with outcomes. For example, a number of studies have 
assessed children’s accumulated months in center care or age at 
entry, finding that greater duration of center EEC predicted height- 
ened behavior problems (Belsky et al., 2007; Loeb et al., 2007; 
NICHD ECCRN, 2003). Other studies have focused on the inten- 
sity of center exposure through hours per week, again finding a 
dosage effect linked to behavioral outcomes (Belsky et al., 2007; 
Coley, Votruba-Drzal, et al., 2013; Loeb et al., 2007; McCartney 
et al., 2010; NICHD ECCRN, 2003, 2006; Vandell, Belsky, 
Burchinal, Steinberg, & Vandergrift, 2010). Less evidence has 
emerged related to the developmental timing of center care and 
children’s behavioral functioning, with a number of studies finding 
no significant patterns between the timing of EEC and children’s 
later functioning (NICHD ECCRN, 2003; Peisner-Feinberg et al., 
2001) and others finding that early entry into center care exacer- 
bates negative associations with children’s behavioral functioning 
in primary school (Coley, Votruba-Drzal, et al., 2013; Loeb et al., 
2007). 

Again, the majority of this evidence derives from studies of 
American children, with limited information concerning whether 
these patterns generalize into the Australian context. Greater 
amounts of center-based EEC in toddlerhood (2-3 years) have 
been associated with higher concurrent behavior problems among 
Australian children (Yamauchi & Leigh, 2011). Claessens and 
Chen (2013) found that center-based preschool was concurrently 
associated with higher prosocial skills and lower peer problems 
according to mother reports, with no significant effects for dura- 
tion of care. In contrast, children experiencing multiple types of 
care showed lower prosocial skills and higher conduct problems. 
Australian studies have not assessed long-term associations be- 
tween the duration and intensity of EEC and children’s behavioral 
functioning following formal school entry or considered the rela- 
tive role of early versus later EEC experiences. 

An interpretational challenge of this broad range of correlational 
studies is the inability to determine whether there is a true causal 
relationship between EEC and children’s later cognitive and be- 
havior functioning, or rather whether selection factors may have 


biased the, measured associations. Studies using experimental or 
quasiexperimental techniques to assess the influence of state pre- 
kindergarten (pre-K; Gormley & Gayer, 2005; Gormley, Phillips, 
Newmark, Welti, & Adelstein, 2011) or Head Start programs (U.S. 
Department of Health and Human Services, Administration for 
Children and Families, 2010) have largely replicated cognitive 
skills benefits but have not found negative effects on behavioral 
functioning. There are four leading explanations for these different 
patterns. First, the experimental and quasiexperimental techniques 
may better adjust for selection bias, suggesting that the negative 
behavioral effects of preschool care in correlational research might 
be due to differential selection into EEC. Second, given the prev- 
alence of EEC attendance in the United States, the control groups 
utilized in the experimental/quasiexperimental evaluations of 
pre-K and Head Start programs were composed of children in 
center care, Head Start, informal nonparental care, and parent care 
(Gormley et al., 2005, 2011; U.S. Department of Health and 
Human Services, Administration for Children and Families, 2010). 
This lack of a clean experimental manipulation of the treatment 
may have weakened the experimental effects. A third explanation 
is that many of the pre-K and Head Start programs assessed in 
experimental or quasiexperimental studies are held to more rigor- 
ous quality standards than most community-based EEC programs, 
suggesting that higher quality programs may limit negative behav- 
ioral effects on children (McCartney et al., 2010). Finally, these 
pre-K and Head Start programs served primarily lower income 
families. Some have argued that cognitive benefits of center EEC 
may be heightened for poor children, children of less educated 
parents, or children in families who provide less enriching home 
environments (Geoffroy et al., 2010; Loeb et al., 2004, 2007; 
McCartney et al., 2007; Votruba-Drzal et al., 2013), although other 
studies have found stronger detriments for behavioral functioning 
among poor children, as well (Loeb et al., 2007). Together these 
issues highlight the importance of more causal methodological 
techniques and of considering the moderating role of family so- 
cloeconomic status. 


Early Childhood Education Policy in Australia 


Although Australia and the United States share many cultural 
and economic similarities, the context of early childhood education 
and care shows some notable differences. These differences 
largely reflect the Australian government’s broader and more 
generous supports to families with young children, as well as 
differences in family practices and maternal employment. For 
example, while the United States lacks a paid federal parental 
leave policy and has a limited unpaid federal leave policy, Aus- 
tralia has long offered a 1-year unpaid parental leave and a gen- 
erous cash payment upon the birth of a child (recently expanded 
into a federal paid parental leave policy). Policy differences also 
exist for mothers receiving welfare in Australia versus the United 
States. Whereas poor mothers are imposed with work requirements 
early in their child’s infancy in many states in the U.S., Australian 
mothers receiving welfare are not required to work until their 
youngest child is 6 years of age (Australian Government Depart- 
ment of Human Services, 2014). Together, these policies encour- 
age more equitable access to resources and parental choice regard- 
ing employment and nonparental care for Australian families 
(Waldfogel, 2009) and may lead to lower use of nonparental care 
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for infants. Indeed, recent data suggest that 35% of Australian 
infants are in regular nonparental EEC at 9 months of age, in 
comparison to 50% of American children (Coley, Lombardi, et al., 
2013). 

The Australian government also provides greater financial sup- 
port for families who do use EEC. In the United States direct 
government subsidies for EEC are reserved for low-income and 
poor families, with middle and upper income families receiving 
less generous tax credits. Further, most American families rely on 
the private market for EEC (U.S. General Accounting Office, 
1997). In contrast, the Australian federal government covers up to 
half of families’ center EEC costs (capped at $7,500 yearly) 
through the Child Care Rebate and provides substantial subsidies 
for registered informal EEC including relative care (Australian 
Government Family Assistance Office, 2011; Michel, 2003). In 
addition, public preschools are funded directly by state govern- 
ments in Australia, and programs are more heavily regulated. 
These differences translate into different prevalence rates: al- 
though preschool attendance is lower among economically and 
socially disadvantaged families in both countries (Dowling & 
O’ Malley, 2009; Harrison & Ungerer, 2005), a higher proportion 
of 4-year-olds attend preschool in Australia, primarily part-time 
center preschool programs, whereas in the United States preschool 
attendance is lower but more likely to be full-time (Australian 
Bureau of Statistics, 2006; Coley, Lombardi, et al., 2013; Harrison 
& Ungerer, 2005; Harrison et al., 2009). For example, a recent 
comparison of nationally representative samples in the two coun- 
tries found that 75% of Australian 4-year-olds attended center 
preschool programs, 63% part time and 11% full time at the time of the 
interview. In the United States, 69% of 4-year-olds were attending 
center-based preschools, equally split between part-time and full- 
time (Coley, Lombardi, et al., 2013). However, with the rapid 
expansion in publicly funded state pre-K programs in the United 
States that is currently underway, the Australian system provides 
an interesting example of what preschool options may look like in 
coming years. 

In addition to these differences in financial support for EEC, 
public support and quality regulations for EEC differ between the 
countries as well. Like in the United States, quality regulations in 
Australia vary across states and territories, although the federal 
government has recently developed a National Quality Framework 
(NQF) regulation system that is replacing separate state licensing 
procedures. This system evaluates the majority of care settings in 
Australia with teacher education and care ratio requirements, 
which differs notably from the United States where roughly 25% 
of children experiencing care attend unregulated care settings 
(Zigler, Marsland, & Lord, 2009). These differing regulations 
result in varying experiences for children across the two countries. 
A recent comparison found that 96% of Australian 2-year-olds 
attending EEC centers were attending accredited programs, com- 
pared to only 32% in the United States; these differences were 
similar for preschoolers, with rates of accreditation among EEC 
centers attended by 4-year-olds at 100% in Australia and only 49% 
in the United States. Similar differences emerged in teacher train- 
ing, with 82% of Australian 2-year-olds but only 23% of American 
2-year-olds in center EEC programs having a head teacher with a 
degree in early childhood education or a related field (Coley, 
Lombardi, et al., 2013). 


In short, policy comparisons suggest that the more generous and 
flexible family policies allow Australian parents more freedom 
than their American counterparts in selecting and affording EEC 
for their young children, which may lead to more stable EEC 
experiences and greater parental satisfaction with EEC choices. 
Greater EEC regulations similarly may lead to higher quality EEC 
in Australia than the U.S., although this supposition has not been 
directly assessed. These differences in the EEC context lead to the 
hypothesis that the benefits of EEC programs for children’s cog- 
nitive skills may be stronger in Australia, and perhaps that detri- 
ments in terms of children’s behavior problems may be lessened. 


Present Study 


In summary, research findings on EEC have been robust across 
numerous longitudinal data sets of children from the United States; 
however, little research has sought to replicate this research in 
other countries to assess the generalizability and universality of 
findings. With a similar economic and social structure and yet 
some differences in EEC use, funding, and quality controls, Aus- 
tralia offers a context to replicate this research while also exam- 
ining policy differences. The primary goals of this research were to 
delineate the contributions of the duration, intensity, and timing of 
exposure to center-based EEC to children’s cognitive skills and 
behavioral functioning following entry to formal schooling, utiliz- 
ing a nationally representative sample of Australian children fol- 
lowed prospectively from infancy through first grade and rigorous 
statistical methods to help adjust for differential selection into 
EEC. Based upon past research conducted mostly in the United 
States, we expected that greater exposure to center EEC through an 
earlier age of entry or greater hours per week would be associated 
with both higher cognitive skills as well as higher behavior prob- 
lems for children. In terms of the timing of EEC, we expected that 
center care after age 2 would be more positively linked to cogni- 
tive skills than infant center care, whereas for behavioral function- 
ing, we expected that a greater duration or intensity of exposure to 
center care would be associated with greater behavior problems 
and lower attention skills, with less consistent associations with 
prosocial behaviors. Given the greater flexibility and perhaps 
higher quality of care in Australia, we expected that cognitive 
effects would be somewhat stronger and behavioral effects some- 
what weaker than results that U.S. studies have unearthed. 


Method 


Sample 


Data for this article were drawn from the Longitudinal Study of 
Australian Children Birth Cohort (LSAC), a multimethod study 
seeking to document children’s development and proximal envi- 
ronments from infancy through childhood. The LSAC sampled a 
nationally representative cohort of 5,107 children born in Australia 
between March 2003 and February 2004. Births were sampled 
from the Medicare enrollment database, in which all Australian 
children are enrolled. Stratification was used to ensure propor- 
tional geographic representation for each state and territory. The 
survey sample excluded nonpermanent residents and children with 
the same name as deceased children, and only allowed for one 
child per household. A study comparing the LSAC population with 
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the 2001 Census population found that the LSAC largely mirrored 
the general population with few differences across a wide range of 
demographic measures (Soloff, Lawrence, Misson, & Johnstone, 
2006). For more information on the LSAC, see Soloff, Lawrence, 
and Johnstone (2005). 

The LSAC has collected seven waves of data with in-person 
interviews and direct assessments as well as mail-in surveys. 
In-person interviews with parents occurred when children were 9 
months (Wave 1), 3 years (Wave 2), 5 years (Wave 3), and 7 years 
(Wave 4) with response rates of 58%,' 90%, 86%, and 84%, 
respectively. Mail-in written parent surveys were collected every 
in-between year when children were 2 years (Wave 1.5), 4 years 
(Wave 2.5), and 6 years (Wave 3.5) with response rates of 70%, 
64%, and 59%, respectively. At each wave there was some vari- 
ability in the age at which children were assessed, with standard 
deviations averaging about 3 months at each wave (see Table 1) 
and approximately 90% of children falling within +4 months of 
the target age at each assessment. Age at assessment was included 
as a covariate in all models to help adjust for these differences. 

The analytic sample included all children in the Wave | sample, 
N = 5,107. Within the analytic sample there were missing obser- 
vations due to attrition over the waves and missing data on indi- 
vidual measures. Because missing data introduce biases into the 
sample, missing data were imputed using multiple imputation by 
chained equations, implemented in Stata 12 (Royston, 2004, 2005) 
to create 10 complete data sets. After imputation, survey weights, 
which adjust for selection criteria and differential response, were 
incorporated in all analyses. The use of these weights makes the 
sample representative of infants born in Australia between March 
2003 and February 2004. 

The LSAC offers a number of strengths for the purposes of this 
research. The data are nationally representative and hence present 
a generalizable sample of young children. The sample is large and 
includes sizable subsamples of economically disadvantaged chil- 
dren and Aboriginal children. Moreover, the sample contains very 
strong measurements of children’s development using reliable and 
well-validated instruments. Data on child functioning were col- 
lected from three different sources at age 7: direct assessments, 
parent reports, and teacher reports. The use of multiple sources of 
information on children’s functioning is important on many fronts. 
First, it helps to assuage analytic concerns over shared method 
variance. Second, parent and teacher reports of children’s behay- 
ioral functioning and direct assessments and teacher reports of 
children’s cognitive skills are only moderately correlated (due both 
to differences in children’s functioning across contexts and to 
differences in raters’ expectations and impressions of children; 
Cabell, Justice, Zucker, & Kilday, 2009; Kilday, Kinzie, Mash- 
burn, & Whittaker, 2012; Strickland, Hopkins, & Keenan, 2012), 
yet all show predictive validity to longer-term functioning sug- 
gesting that they provide unique windows into children’s well- 
being. 

Another strength of the LSAC is that it provides five points of 
data from infancy to preschool, providing rich information on 
children’s EEC experiences through early childhood, and followed 
children through school entry. At the same time, it is important to 
acknowledge limitations, namely, that data on EEC were only 
collected at discrete time points and do not include a full account 
of EEC experiences from birth through preschool and that the 
developmental quality of care, another important characteristic of 


early care experiences, was not assessed in the LSAC and hence 
could not’be taken into account. 


Measures 


EEC characteristics. At Waves 1 through 3 (Waves 1, 1.5, 2, 
2.5, and 3) parents reported on children’s regular nonparental care 
settings. At each wave, EEC type was coded into three mutually 
exclusive categories designating center-based (day care center, 
preschool, or other center-based child care program), informal care 
(grandparent, other relative, nanny, other nonrelative, family day 
care, occasional care, gym/leisure/community center, mobile care 
unit), or parent care. Children in regular nonparental care settings 
for less than 5 hr per week and children who did not experience 
any nonparental care were coded as being in parent care. Children 
in center care, including those attending only center care and those 
attending both center care and informal care, were coded as being 
in center care.* Children only in informal care were coded as 
informal care. Mothers also reported the total number of hours per 
week that children spent in EEC at Waves 1, 1.5, 2, 2.5, and 3 of 
the survey, coded in units of 10. At Wave 3, about 20% of children 
had entered kindergarten; for these children only four waves of 
data on EEC were available. For the 80% of children who had not 
entered kindergarten by Wave 3, five waves of data on EEC were 
assessed. 

These EEC measures were then used to create three sets of 
variables to delineate children’s duration, intensity, and timing of 
exposure to center-based EEC. The duration of center EEC was 
measured through a variable indicating the percentage of waves in 
center care. Another variable delineated the percentage of waves in 
informal care. The intensity of center EEC was measured with the 
average number of hours children spent in center EEC (hours were 
coded as 0 for waves in which children were not in any center 
care). Finally, to assess the timing of center EEC, we created six 
mutually exclusive categories: (a) parent or informal care only 
from infancy until primary school, (b) center care from infancy 
(Wave 1) through preschool (Waves 2.5 and/or 3), (c) center care 
from toddlerhood (Waves 1.5 and/or 2) through preschool, (d) 
center care only during preschool (Waves 2.5 and/or 3), (e) center 
care in infancy and/or toddlerhood but not in preschool, and (f) 
inconsistent center care that included some center care in infancy 
and/or toddlerhood plus preschool. Initial analyses of these groups 
determined that there were no differences in functioning between 
the children who entered center care in infancy and stayed (Group 
2), entered center care in toddlerhood and stayed (Group 3), and 
children who entered center care early and attended preschool but 
spent at least one wave in parent or informal care (Group 6). Thus 
we collapsed these three categories into one, resulting in four 
mutually exclusive categories indicating the timing of center care: 
(a) no center care from infancy through preschool, (b) center care 
in infancy and/or toddlerhood and preschool, (c) center care only 


‘ Different response rates have been reported based on different calcu- 
lations. This response rate includes nonresponse from all sources from the 
originally drawn sample (see Gray & Sanson, 2005). 

Center care was prioritized in this manner due to extant literature 
suggesting the significant role of center care in children’s development 
(Loeb, Bridges, Bassok, Fuller, & Rumberger, 2007; Magnuson, Meyers, 
Ruhm, & Waldfogel, 2004; Morrissey, 2010). 
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Table 1 
Sample Descriptives 


eee eee 


Variable M SD % Moly GSD 16% M SsDinn% MiISDis % Mic “SDI9% M 





Care type 


Parent 65.46 47.39 33.17 14.53 6.98 
Center 10.69 26.16 41.61 74.69 90.64 


intopnal 23.85 26.49 25.22 10.78 2.38 
Care duration 


% of waves center care 
% of waves informal care 
Care intensity 
Hours of care/10 
Average hours of care/10 
Timing of care 
Parent only care 7.63 
Early center plus 
preschool 44.87 
Preschool only 45.27 
Early center, no preschool 2.23 
Child outcomes 
Academic skills 3325095 
Matrix reasoning 10.60 3.02 
Vocabulary 74.00 5.15 
Parent Attention Skills 0.89 0.45 
Parent conduct problems 0.32 0.29 
Parent prosocial skills 1.66 0.34 
Teacher attention skills 0.98 0.51 
Teacher conduct problems 0.20 0.28 
Teacher prosocial skills 1.50 0.44 
Covariates 
Multiple care 
arrangements 8.41 17.96 18.54 36.65 39.83 
Child age 8.86 2.57 21.49 3.38 34.04 2.92 47.28 3.34 57.74 2.86 81.98 3.15 
Child male 51.16 
Child low birth weight 6.02 
Child bad health B12 
Child temperament 4.45 0.62 
Child cognitive skills 25.88 9.70 
Parent Asian 8.51 
Parent Aboriginal 4.58 
Immigrant household 31.50 
Non-English household 15.67 
Child number siblings 0:99" L07 1.28 1.03 eS iemleOL 
Parent married 73.34 71.42 73.09 
Youngest parent’s age 30.41 5.29 
Parent < high school 
education 6.90 6.01 537 
Parent high school 
education 6.50 5.67 4.69 
Parent some college 48.01 49.27 49.31 
Parent college/grad school 38.59 39.05 40.63 
Household income/10,000 6.89 4.16 7.83 4.86 9.62 5.53 
Low income 19.94 
Mother employed 32.51 49.83 58.69 
Cognitive stimulation 1.92 0.56 1.65 0.57 


0.46 0.26 
0.16 0.21 


0.23 0.78 0.51 1.06 
0.94 0.81 


0.84 1.28 1.34 1.22 1.59 1.46 


during preschool, and (d) center care in infancy and/or toddlerhood multiple care arrangements at all waves (Waves 1, 1.5, 2, 2.5, and 


but not in preschool. 

In addition to the main EEC variables of interest, indicator 
variables were created across waves denoting whether children 
were in multiple EEC arrangements, given recent research sug- 
gesting that multiple care arrangements may be linked to worse 
functioning in children (Claessens & Chen, 2013; Morrissey, 
2009). These variables assessed whether children experienced 


3), some waves, or did not experience multiple care experiences at 
any wave (reference). 

Children’s outcomes. Measures of child functioning were 
drawn from Wave 4, when children averaged 7 years of age and 
were typically in Year 1 (first grade) of primary school. Wave 4 
was chosen because child functioning measures were not assessed 
at Wave 3.5, and at Wave 3 about 80% of children had not yet 
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entered primary school and were still attending EEC programs. 
Three measures of children’s cognitive skills were assessed at age 
7 using direct assessments and teacher reports: vocabulary, aca- 
demic skills, and matrix reasoning. Children’s receptive vocabu- 
lary skills were directly assessed by field interviewers using a 
shortened version of the Peabody Picture Vocabulary Test (3rd ed.; 
PPVT-II; Dunn & Dunn, 1997). The PPVT was scored using item 
response theory (IRT) and then transformed to generate a scale 
with a mean of 64 and a standard deviation of 8. Children’s 
academic skills were assessed with teacher reports using the Lan- 
guage and Literacy and Mathematical Thinking subscales from the 
Academic Rating Scale (ARS; National Center for Educational 
Statistics, 2002). The Language and Literacy Scale (a = .96) had 
nine items (e.g., conveys ideas when speaking, reads fluently) that 
rate a child’s performance in oral and written language according 
to a 5-point scale (not yet = 1, beginning = 2, in progress = 3, 
intermediate = 4, and proficient = 5). The Mathematical Thinking 
Scale (a = .94) used the same scale to rate a child’s performance 
on nine spatial and math items (e.g., creates and extends patterns, 
recognizes shape properties and relationships). Due to the high 
correlation between the two scores (r = .81), the measures were 
averaged to create one composite assessing teacher-reported lan- 
guage, literacy and mathematical thinking, termed “academic 
skills.” The final measure of children’s cognitive achievement was 
measured with the Matrix Reasoning (MR) test from the Wechsler 
Intelligence Scale for Children (4th ed.; WISC-IV). This test of 
nonverbal and fluid intelligence (35 items) presents the child with 
an incomplete set of diagrams and requires the child to select the 
picture that completes the set from five different options. 
Children’s behavioral functioning was reported separately by 
parents and by teachers using items from the Strengths and Diffi- 
culties Questionnaire (SDQ; Goodman, 1997), which rates chil- 
dren’s skills on a 3-point scale (not true = 0, somewhat true = 1, 
and certainly true = 2). Factor analyses run separately by reporter 
derived three subscales assessing attention skills, prosocial behav- 
iors, and conduct problems. For each subscale, items were aver- 
aged into a total score, leading to three scores from parent reports 
and three scores from teacher reports. The attention skills sub- 
scales (a, = .78; a, = .88) each included five items assessing 
children’s ability to sit still, fidgeting, distractibility, thinking 
before acting, and attention span. Higher scores indicate greater 
attention and learning behaviors. The prosocial behaviors sub- 
scales (a, = .70; a, = .83) were composed of five items assessing 
children’s considerate, sharing, helpful, kind, and volunteering 
behaviors. Also composed of five items, the conduct problems 
subscales (a, = .60; a, = .76) covered children’s temper tantrums, 
obedience, fighting, lying or cheating, and stealing behaviors. 
Child characteristics. A broad range of child and family 
characteristics were assessed through parent report. Child charac- 
teristics included age of assessment (in months) and gender. Child 
low birthweight status was represented with an indicator of 
whether the child was born low (less than 2,500 grams) birth- 
weight. Child health condition was also represented by an indicator 
which reflected whether the child was of fair or poor health based 
on parent-report at Wave 1. Cognitive ability was assessed at 
Wave | using the Communication and Symbolic Behavior Scales 
Developmental Profile: Infant-Toddler Checklist (CSBS DP; 
Wetherby & Prizant, 2001). The CSBS (24 items, a = .89) yields 
a standardized normed score of children’s early social, language 


and cognitive skills. Child temperament was measured with a 
shortened version of the Australian revision of the Toddler Tem- 
perament Scale (TTS; Fullard, McDevitt, & Carey, 1984), with 
four items rated on 6-point scales assessing children’s abilities in 
each of three domains: approach, persistence, and reactivity (a = 
0.98—0.99). These three domains were combined into a composite 
measure with higher scores indicating an easier temperament with 
more approachability, persistence, and regulation. 

Parental and household characteristics. Several parental 
and household characteristics were also assessed. Time-varying 
characteristics were measured at Waves 1, 2, and 3 and were coded 
to tap into shifts over time. Parent race/ethnicity was indicated 
with two dummy variables indicating (a) Asian origin and (b) 
Aboriginal origin of either parent. A dichotomous variable indi- 
cated whether either parent was an immigrant to Australia. An 
additional dichotomous variable indicated whether the primary 
language of the household was non-English. Family structure 
covariates included maternal marital status, measured categori- 
cally, delineating whether the respondent was consistently married 
(vs. single or cohabiting) at all waves, married at some waves, or 
not married at any wave (reference), and the number of children 
under age 18 in the household, measured at Wave 1 and then 
measured as changes in the number of children at Waves 2 and 3. 
Parental age was measured with a continuous measure of the age 
in years of the youngest parent in the household at Wave 1. 
Parental education was assessed using the highest level of educa- 
tional attainment that parents reported at Waves 1 through 3 of 
data collection (shifts over time were not assessed due to limited 
change). Categorical indicators designated parents who had earned 
less than a high school degree, a high school degree but no college 
(reference), above a high school degree but less than a college 
degree, and a Bachelor’s degree or higher. Total household annual 
income was expressed in units of 10,000 at Wave 1 and then 
measured as changes in income at Waves 2 and 3. Maternal 
employment was measured categorically across waves, delineating 
whether mothers were consistently employed across all waves, 
employed at some waves, or not employed at any wave (refer- 
ence). At Waves 2 and 3 of the survey, parents’ provision of 
enriching home environments was assessed. Parents reported the 
weekly frequency of a variety of activities such as drawing pic- 
tures with, reading to, and playing outdoors with their child (seven 
items). Responses ranged from 0 (none) to 3 (6-7 days per week) 
and were averaged within each wave, creating a measure of cog- 
nitive stimulation at Wave 2 and changes in stimulation at Wave 3 
(a = 0.70-0.71). 


Analytic Approach 


The primary goal of the analyses was to assess how exposure to 
center EEC from infancy through preschool was associated with 
Australian children’s cognitive and behavioral skills following 
school entry. This question was assessed using a series of ordinary 
least squares (OLS) regression models predicting children’s func- 
tioning at Wave 4 from their duration, intensity, and timing of 
center EEC exposure. Given that prior research has identified 
notable differences between characteristics of children in parental 
or informal care and children in center EEC programs (Coley, 
Votruba-Drzal, Collins, & Miller, 2014; Meyers & Jordan, 2006), 
a primary concern for the present study is that selection processes 
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rather than child care experiences per se may explain any associ- 
ations with children’s cognitive and behavioral functioning. To 
address this significant concern, three techniques were incorpo- 
rated in the analyses. First, all OLS regression models incorporated 
a large set of child, maternal, and household characteristics drawn 
from Waves | through 3 as covariates. Just as the EEC variables 
were aggregated over Waves 1 through 3 to capture children’s 
exposure to EEC throughout early childhood, the covariates sim- 
ilarly were coded to assess children’s changing contexts over 
Waves 1 through 3. Covariates were selected (listed in Table 1) 
because they have been shown to be associated with selection into 
child care in prior research (Coley et al., 2014; Meyers & Jordan, 
2006), although even the most thorough set of covariates leaves 
open the potential for omitted variable bias (Duncan, Magnuson, & 
Ludwig, 2004). As a second mechanism helping to control for 
unmeasured variable bias, models were run as lagged regressions, 
incorporating a Wave | measure of cognitive ability (for models 
predicting cognitive skills) or a Wave 1 measure of child temper- 
ament (for models predicting behavioral outcomes) as an addi- 
tional covariate to control for unmeasured, time-invariant factors 
that have a consistent effect on children’s functioning (Cain, 
1975), thus further reducing concerns of omitted variable bias. 

Third, propensity score weighting (PSW) techniques were used 
to help further adjust for potential selection bias (Imbens, 2000; 
Rosenbaum & Rubin, 1983). Propensity score (PS) techniques 
help to equate respondents on observed, preexisting characteristics 
(Rosenbaum & Rubin, 1983). Propensity score techniques have 
been shown to remove as much as 90% of selection bias in 
nonexperimental research (Leon & Hedeker, 2007), although it is 
important to note that PS techniques cannot control for unobserved 
factors, the influence of which may even be magnified by match- 
ing on observables (Pearl, 2009). We incorporated propensity 
scores using the three step weighting procedure described by 
Imbens (2000). This propensity score weighting procedure is 
highly flexible as it is able to accommodate both continuous and 
categorical measures such as our primary measures of center EEC 
(Imbens, 2000). The first step involved estimating each child’s 
propensity to receive the “treatment,” that is to be in center EEC. 
For the continuous measures of duration (% waves in center EEC) 
and intensity (average hours of center EEC), the first step used 
OLS regression models to estimate the propensity of having a 
higher % of waves or more hours in center care as a function of 
observed pretreatment covariates (all child and family character- 
istics noted above drawn from Wave 1). For the categorical timing 
of center EEC variable, multinomial logistic regression models 
were used to estimate the propensity of being in each of the four 
patterns of EEC as a function of all observed pretreatment (Wave 
1) covariates. Appendix Table A1 presents results of these models 
Using one randomly selected imputed data set as an exemplar. In 
the second step, propensity score weights were created by taking 
the inverse of each child’s conditional probability of receiving the 
EEC treatment that the child actually received (Imbens, 2000). In 
the third step, we incorporated the propensity score weights in the 
lagged longitudinal regression models predicting child functioning 
at age 7. These models were run weighted with the EEC treatment- 
specific propensity score weights multiplied by the sample weights 
to generate the average treatment effect of the EEC experience, as 
shown in Equation 1. 


Child Outcome,; = Bo =e B, EEC, _3; ad B,Child Outcome); 
ap B Maternal, 337 Ge B,Child, —3; ots €. 
(1) 


After the first set of models assessing associations between 
children’s duration, intensity, and timing of exposure to center 
EEC and their cognitive and behavioral skills, a second set of 
analyses were estimated to address whether associations between 
EEC experiences and children’s later functioning differed depend- 
ing on family income, parental education, home environment, or 
child temperament. Income and home environment moderation 
were tested with interactions between centered, continuous mea- 
sures of income (averaged over Waves 1-3) and home environ- 
ments (averaged over Waves 2 and 3) and each of the EEC 
variables. Education moderation was tested with interactions be- 
tween categorical indicators of parental education (less than high 
school, some college, and a college or graduate degree) and each 
of the EEC variables, with posthoc analyses to address differences 
between groups. Moderation by child temperament was assessed 
with interactions between children’s centered Wave 1 tempera- 
ment measure and each of the EEC variables. 


Results 


Sample Descriptives 


Descriptive statistics for the EEC measures, child and family 
covariates, and children’s age 7 functioning are displayed in Table 
1. Participation in EEC grew from 35% during infancy to 93% 
when children were nearly 4 years old (among children not yet in 
kindergarten), with notable growth in center care and decline in 
parent and informal care. Overall, children were in center-based 
programs nearly half of the time from infancy through age 4, 
spending an average of 46% of waves in center care and 16% of 
waves in informal care. The intensity (hours per week) of center 
EEC also increased, although by the preschool-age wave children 
averaged only 16 hr per week in center care. Over all time periods, 
children experienced an average of less than 10 hr per week in 
center care (8% had 0 hr, 59% less than 10, 24% averaged 10 to 20 
hr, and only 10% of children averaged 20 hr or more per week of 
center care). 

Although the patterns of increasing use and intensity of center 
EEC suggests a linear increase, an examination of individual 
children’s experiences found several distinct patterns of center 
EEC exposure for Australian children. Approximately 45% of 
children were in center care at all waves or nearly all waves, 
including both during infancy or toddlerhood and during pre- 
school. Another 45% of children only used center care during 
preschool. A very small percentage (2%) attended center care at 
some point during infancy or toddlerhood but did not attend 
preschool. Another small group (8%) did not attend center care at 
all prior to entering kindergarten. 


EEC Predicting Child Cognitive Skills at Age 7 


The first set of models shown in the top panel of Table 2 (Model 
1) examined whether there were differences in children’s cognitive 
skills at age 7 depending on the duration of center and informal 


292 


Table 2 


COLEY, LOMBARDI, AND SIMS 


Propensity Score Weighted OLS Regression Models With Duration, Intensity, and Patterns of 
EEC Predicting Child Cognitive Skills at Age 7 


Independent variable 


% waves center care 

% waves informal care 
Covariates 

Multiple care all waves 

Multiple care some waves 

Child age 

Child male 

Child low birth weight 

Child bad health 

Child cognitive skills 

Parent Asian 

Parent Aboriginal 

Immigrant household 

Non-English household 

Wave | siblings 

Wave 2 change in siblings 

Wave 3 change in siblings 

Parent married all waves 

Parent married some waves 

Youngest parent’s age 


Parent < high school education 


Parent some college 

Parent college/grad school 
Wave | household income 
Wave 2 change in income 
Wave 3 change in income 
Mother employed all waves 


Mother employed some waves 
Wave 2 Cognitive stimulation 
Wave 3 Change in cognitive stimulation 


F of model 
R2 


Avg. hours in center care 
F of model 
R2 


Early center plus preschool 
Preschool only 

Early center, no preschool 
F of model 

R2 


Teacher Matrix 
academic skills reasoning Vocabulary 
Model 1: Duration of care 
0.13 (0.07)' 0.65 (0.23)"* 0.12 (0.41) 
0.05 (0.08) 0.10 (0.28) =O 1k(O52) 
0.01 (0.10) —0.38 (0.44) 0.73 (0.72) 
—0.01 (0.04) —0.01 (0.14) 0.13 (0.26) 
0.03 (0.01)** —0.04 (0.02)* 0.24 (0.03)*™* 
—0.12 (0.03)** —0.09 (0.09) OSEOM) 
—0.21 (0.06)** —0.39 (0.24) —0.16 (0.42) 
0.00 (0.07) 0.10 (0.30) —0.88 (0.46)* 
0.01 (0.00)* 0.01 (0.01) 0.02 (0.01)* 
0.09 (0.07) 0.38 (0.29) —0.50 (0.44) 
—0.15 (0.08)* —0.02 (0.3) —0.94 (0.53)" 
0.00 (0.03) 0.24 (0.13)" 0.02 (0.22) 
—0.02 (0.06) —0.18 (0.21) —1.80 (0.33)™* 
—0.06 (0.02)** —0.20 (0.06)** —0.69 (0.11)™* 
0.05 (0.03)* —0.01 (0.11) —0.16 (0.18) 
0.02 (0.03) 0.07 (0.11) 0.18 (0.19) 
0.16 (0.04)** 0.21 (0.16) 0.35 (0.25) 
0.09 (0.05)* 0.05 (0.20) 0.18 (0.30) 
0.01 (0.00)* 0.03 (0.01)*™* 0.14 (0.02)** 
—0.19 (0.09)' —0,.52 (0.40) —0.53 (0.64) 
—0.02 (0.06) 0.06 (0.24) 0.63 (0.40) 
0.13 (0.06)* 0.65 (0.25)* 1.66 (0.41)™ 
0.14 (0.03)** 0.49 (0.14)** 0.87 (0.24)** 
0.06 (0.04) 0.63 (0.17)*™* 0.47 (0.28) 
0.08 (0.04)* 0.09 (0.18) —0.06 (0.25) 
0.04 (0.04) 0.08 (0.16) 0.16 (0.29) 
0.04 (0.03) —0.03 (0.13) 0.26 (0.22) 
0.05 (0.03) 0.20 (0.12) 1.12 (0.18)** 
—0.01 (0.03) —0.05 (0.11) 0.31 (0.20) 
15.28 7.74 21.05 
0.15 0.07 0.19 
Model 2: Intensity of care 
0.02 (0.07) 0.61 (0.29)* 0.39 (0.48) 
13.09 7.63 18.01 
0.14 0.07 0.19 
Model 3: Patterns of care 
0.02 (0.07) 0.61 (0.29)** 0.39 (0.48) 
—0.03 (0.07) 0.29 (0.27)* 0.09 (0.44) 
—0.13 (0.15) 0.90 (0.49) —0.13 (0.79) 
4.44 DBS 5.56 
0.16 0.13 0.19 





Note. OLS = ordinary least squares; EEC = early education and care; Avg. = average. Models 2 and 3 


included all covariates. Within columns, matched superscripts indicate difference at p < .05. 
pia lm pp 05a men pr Ol. 


EEC that children experienced from infancy until kindergarten 
entry. Models were weighted with PSW weights that adjusted for 
each child’s propensity of being in higher percentages of center 
care, based upon observed characteristics from Wave 1. Due to 
limitations of PSW, we could not adjust for both the percentage of 
center EEC and percentage of informal EEC within the same 
model, however we ran models adjusting for each and found the 
results to be nearly identical between sets of models, so present 
only the models adjusting for the percentage of center EEC. 
Results from Model 1 indicate that greater exposure to center care 


from infancy through preschool was associated with enhanced 
cognitive skills at age 7. The effect sizes were very small. A 
one-standard-deviation (SD) difference in waves of center care 
was predictive of .06 SDs higher matrix reasoning scores with 
nonsignificant results for academic and vocabulary skills. The 
percentage of waves in informal care was not significantly asso- 
ciated with any measure of children’s cognitive skills, supporting 
the primacy of center EEC programs in children’s functioning. 
Similarly, exposure to concurrent multiple care arrangements was 
not predictive of children’s cognitive skills in first grade. 
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The second panel in Table 2 (Model 2) considered children’s 
intensity of exposure to center EEC (the hours per week children 
attended centers from infancy through preschool). Greater inten- 
sity of center EEC was associated with higher matrix reasoning 
skills, with a small effect size of .16 SDs but was not linked to 
children’s academic or vocabulary skills. Results from the third set 
of models assessing the role of center EEC timing are presented in 
the third panel (Model 3). These models found that children in 
center care during both infant/toddler as well as preschool years 
had higher matrix reasoning skills after school entry than did their 
peers who only attended center-based preschool (.11 SD, indicated 


by shared superscripts) or who were consistently in parent care 
(.20 SD). The timing of care was not significantly related to 
children’s academic or vocabulary skills. 


EEC Predicting Child Behavioral Skills at Age 7 


A parallel set of models was run predicting both parent reports 
and teacher reports of children’s behavioral skills at age 7, with 
results presented in Table 3. From the first set of models examin- 
ing duration of center and informal EEC, results found that greater 
exposure to center care from infancy through preschool was asso- 


Table 3 


Propensity Score Weighted OLS Regression Models With Extent, Intensity, and Patterns of EEC Predicting Child Behavioral 


Functioning at Age 7 





Parent Parent Teacher Teacher Teacher 
Independent variable attention Parent conduct prosocial attention conduct prosocial 
Model 1: Duration of care 
% waves center care —0.10 (0.04)* 0.04 (0.02)* —0.05 (0.03) —0.07 (0.05) 0.06 (0.03)* —0.02 (0.04) 
% waves informal care —0.09 (0.04)* 0.04 (0.03) —0.03 (0.04) —0.07 (0.05) 0.03 (0.03) —0.02 (0.05) 
Covariates 

Multiple care all waves —0.07 (0.07) —0.02 (0.04) 0.04 (0.05) —0.03 (0.07) —0.03 (0.05) 0.04 (0.07) 
Multiple care some waves —0.02 (0.02) 0.03 (0.01)* —0.01 (0.02) —0.04 (0.02) 0.01 (0.01) —0.03 (0.02) 
Child age . 0.00 (0.00) 0.00 (0.00) 0.00 (0.00) 0.00 (0.00) 0.00 (0.00) 0.00 (0.00) 
Child male —0.18 (0.02)** 0.07 (0.01)** =O (O01) —0.32 (0.02)** 0.09 (0.01)** —0.24 (0.02)** 
Child low birth weight —0.06 (0.04) 0.02 (0.02) —0.01 (0.03) —0.06 (0.04) 0.00 (0.02) —0.01 (0.04) 
Child bad health —0.11 (0.05)* 0.03 (0.03) —0.05 (0.03) —0.02 (0.06) 0.02 (0.03) —0.05 (0.05) 
Child temperament 0.05 (0.01)™* —0.05 (0.01)*™* 0.05 (0.01)** —0.04 (0.01)** 0.02 (0.01)* —0.02 (0.01) 
Parent Asian 0.04 (0.04) —0.05 (0.02)' —0.01 (0.03) 0.09 (0.04)* ° —0.03 (0.02) —0.03 (0.04) 
Parent Aboriginal —0.02 (0.05) 0.02 (0.03) —0.03 (0.03) —0.14 (0.06)* 0.10 (0.04)** —0.12 (0.06)* 
Immigrant household 0.02 (0.02) —0.03 (0.01)* —0.02 (0.01) —0.01 (0.02) 0.02 (0.01) —0.02 (0.02) 
Non-English household —0.02 (0.03) 0.04 (0.02)* —0.02 (0.02) —0.05 (0.03) 0.01 (0.02) —0.03 (0.03) 
Wave | siblings 0.01 (0.01) 0.01 (0.01) —0.02 (0.01)* —0.01 (0.01) 0.01 (0.01) 0.00 (0.01) 
Wave 2 change in siblings 0.05 (0.02)** 0.00 (0.01) —0.01 (0.01) 0.08 (0.02)** —0.04 (0.01)™* 0.05 (0.02)** 
Wave 3 change in siblings 0.01 (0.02) 0.03 (0.01)** —0.02 (0.01) 0.04 (0.02)* —0.01 (0.01) 0.02 (0.02) 
Parent married all waves 0.07 (0.02)** —0.05 (0.01)™* 0.02 (0.02) 0.08 (0.03)* —0.05 (0.02)* 0.05 (0.02)* 
Parent married some waves 0.03 (0.02) —0.01 (0.02) 0.00 (0.02) 0.04 (0.04) —0.02 (0.03) 0.05 (0.03)* 
Youngest parent’s age 0.00 (0.00)* 0.00 (0.00) 0.00 (0.00) 0.01 (0.00)** 0.00 (0.00)* 0.00 (0.00)* 
Parent < high school —0.04 (0.06) 0.04 (0.04) —0.03 (0.04) —0.03 (0.07) —0.02 (0.04) —0.01 (0.05) 
Parent some college —0.02 (0.04) 0.00 (0.03) 0.03 (0.03) —0.02 (0.06) 0.00 (0.03) —0.05 (0.04) 
Parent college/grad school 0.03 (0.04) —0.03 (0.03) 0.01 (0.03) 0.01 (0.06) —0.01 (0.03) —0.06 (0.05) 
Wave | household income 0.10 (0.02)** —0.05 (0.02)** 0.03 (0.02) 0.08 (0.02)** —0.03 (0.02)* 0.07 (0.02)** 
Wave 2 change in income 0.07 (0.02)** —0.04 (0.02)* 0.01 (0.02) 0.08 (0.03)* —0.04 (0.02)* 0.03 (0.02) 
Wave 3 change in income 0.01 (0.03) —0.01 (0.02) 0.00 (0.02) —0.02 (0.03) 0.00 (0.01) 0.00 (0.02) 
Mother employed all waves 0.04 (0.02) —0.04 (0.02)* 0.02 (0.02) 0.03 (0.03) —0.02 (0.02) 0.01 (0.03) 
Mother employed some waves 0.02 (0.02) —0.03 (0.01)* 0.03 (0.02) 0.05 (0.02)* —0.02 (0.01)* 0.03 (0.02) 
Wave 2 Cognitive stimulation 0.05 (0.02)** —0.05 (0.01)** 0.09 (0.01)** 0.04 (0.02)+ —0.0Z (0.01) 0.03 (0.02)* 
Wave 3 Change in cognitive stim. 0.02 (0.02) —0.03 (0.01)* 0.06 (0.02)** —0.01 (0.02) 0.00 (0.01) 0.00 (0.02) 
F of model 10.93 8.21 12.98 ea 5.46 10.33 
R? 0.09 0.08 0.10 0.15 0.07 0.10 


a 


Model 2: Intensity of care 


Avg. hours in center care —0.05 (0.01)** 0.02 (0.01)** —0.02 (0.01)* —0.05 (0.01)* 0.04 (0.01)**  —0.03 (0.01)* 
F of model 10.23 7.79 12.16 16.31 5.52 9.10 
R? 0.10 0.08 0.11 0.16 0.08 0.11 


Das a se ee eS SS SS eS 
Model 3: Patterns of care 


Early center plus preschool —0.09 (0.05) 0.03 (0.03) —0.02 (0.05) —0.04 (0.05) 0.01 (0.03)* —0.01 (0.05) 
Preschool only —0.06 (0.04) 0.01 (0.03) 0.01 (0.04) —0.01 (0.05) —0.01 (0.02)* 0.01 (0.05) 
Early center, no preschool —0.11 (0.07) 0.07 (0.05) —0.04 (0.05) —0.04 (0.08) —0.02 (0.04) —0.04 (0.08) 
F of model 2.83 1.90 2.67 4.08 Das 255 
R? 0.13 0.09 0.12 0.16 0.08 0.11 


Ir ne en oe 
Note. Models 2 and 3 included all covariates. Within rows matched superscripts indicate difference at p < .05. OLS = ordinary least squares; EEC = 
early education and care; Avg. = average. 
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ciated with lower parent-reported attention skills (.06 SD) as well 
as higher teacher-reported conduct problems (.06 SD) at age 7. 
Like with children’s cognitive skills, neither the percentage of 
waves in informal EEC nor children’s exposure to multiple con- 
current care arrangements were significantly associated with chil- 
dren’s later behavioral functioning with one exception: greater 
exposure to informal care predicted lower parent-reported atten- 
tion skills (.04 SD). 

Results for the intensity of center EEC shown in the following 
panel found a much more consistent pattern in which greater hours 
spent in center care were associated with lower behavioral func- 
tioning across five of the six outcome measures, again with small 
effect sizes. A one-SD increment in average hours of center care 
predicted increased levels of both parent- and teacher-reported 
conduct problems (.06 SD and .12 SD); decreased parent- and 
teacher-reported attention skills (.09 SD and .08 SD); and de- 
creased teacher-reported prosocial behaviors (.06 SD). 

In relation to the timing of care, only one significant effect 
emerged, showing that children in center care during both infant/ 
toddler as well as preschool years had higher teacher-reported 
conduct problems after school entry than did their peers who only 
attended center EEC in preschool, with a small effect size of .07 
SDs. Neither of these groups differed significantly from children 
who never attended center EEC, although it is important to reit- 
erate that the no center care as well as the only early center care 
groups were very small, with limited statistical power to detect 
differences with the two larger groups of center preschool only and 
infant/toddler as well as preschool center care. 


Alternative Model Specifications 


A series of additional model specifications were estimated to 
test the robustness of the main effect results to a variety of 
concerns. First, models were rerun with the covariates entered 
separately at each wave and again with time-varying covariates 
averaged over Waves | through 3. Results were nearly identical to 
models with the covariates aggregated to highlight instability in 
children’s home contexts. Second, models were specified without 
the propensity score weights, including the full set of covariates, 
child lags, and sample weights. Results were very similar to those 
presented in Tables 2 and 3, with a few instances in which the OLS 
results were slightly stronger than the PSW results, suggesting that 
the propensity score weights helped to adjust for selection bias. 


Moderation by Family Socioeconomic Resources 


Following the main effects models we assessed whether 
associations between EEC experiences and children’s cognitive 
and behavioral skills differed for children from more versus less 
advantaged families. Moderation models were estimated for 
each of the three indicators of EEC experiences using first, a 
continuous measure of household income (averaged over 
Waves | through 3); second, categorical indicators of parent’s 
highest level of educational attainment; and third, a continuous 
measure of home cognitive stimulation (averaged over Waves 2 
and 3). We also assessed interactions with a dichotomous 
income measure delineating low-income children. Results 
(available upon request) did not show evidence suggesting that 
a greater duration, greater intensity, or different patterns of 


center EEC were more or less beneficial for children’s cognitive 
skills or behavioral functioning across family socioeconomic 
status. In each set of interactions, significant interactions oc- 
curred at or below the level expected by chance. 


Moderation by Child Temperament 


A final set of models considered whether the associations be- 
tween EEC experiences and children’s later functioning differed as 
a function of child temperament. These models were estimated 
using a continuous measure of child temperament at Wave 1. 
Results did not find that EEC experiences were linked to children’s 
later cognitive and academic skills differently for children with 
easier versus more challenging temperaments (interaction results 
available upon request). 


Discussion 


As governments across many countries increasingly promote 
accessible, affordable, and high quality EEC programs in order to 
support parental employment and prepare children for school, 
greater attention is being drawn to the repercussions of EEC 
experiences for children’s long-term cognitive and behavioral de- 
velopment. An extensive and robust body of research has assessed 
how attending EEC programs is predictive of children’s core 
cognitive and behavioral functioning following school entry; how- 
ever the vast majority of the findings derive from U.S. studies with 
little replication in other countries. As such, we have limited 
knowledge concerning other policy models of EEC and the gen- 
eralizability of EEC effects across different populations and in 
diverse policy and cultural environments. 

With a similar economic structure and use of both private and 
public provision of EEC as in the United States, Australia offers a 
context to replicate this research. In this article, we assessed links 
between center-based EEC and children’s cognitive skills and 
behavioral functioning in first grade in a nationally representative 
sample of Australian children, attending to the duration, intensity, 
and developmental timing of children’s experiences in center EEC 
programs. Incorporating numerous analytic techniques to adjust 
for selection bias, analyses found that greater duration and inten- 
sity of center EEC from infant through preschool years were linked 
with small enhancements in children’s nonverbal fluid intelligence 
skills according to direct assessments but were not associated with 
children’s vocabulary or academic (language, literacy, and math) 
skills. Greater duration and intensity of center EEC exposure was 
also predictive of small detriments to behavioral functioning, with 
children evincing lower attention skills, higher conduct problems, 
and lower prosocial behaviors according to both parent and teacher 
reports. Time in informal EEC arrangements as well as exposure to 
multiple concurrent EEC arrangements, in contrast, were not as- 
sociated with children’s later cognitive or behavioral functioning. 

An examination of the timing of center care experienced by 
children provided some evidence suggesting that the results were 
being driven by children who were in center care during infancy/ 
toddlerhood as well as preschool years; these children showed 
more advanced nonverbal fluid intelligence but also heightened 
conduct problems in first grade in comparison to their peers who 
had attended center programs only during preschool. Children who 
attended center-based EEC programs only during infancy/toddler- 
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hood did not differ significantly from their peers, although we 
caution that this group was extremely small, as the majority of 
Australian children attend preschool programs. The limited statis- 
tical power from small sample sizes made it difficult to properly 
model the effect of not attending center-based preschool. 

Together, the combined results from the models assessing the 
duration, intensity, and timing of center EEC suggest that an 
accumulation of center care over both earlier and later years prior 
to school entry as well as higher hours of center care were driving 
both the enhancements in fluid intelligence as well as the detri- 
ments in behavioral and social skills found in children as they 
entered middle childhood. Prior to discussing the implications of 
these results, it is important to note limitations. Data limitations 
that were inherent in our reliance on panel data from the LSAC 
included incomplete information on all care settings attended and 
information on EEC only at distinct developmental periods rather 
than a continuous accounting of EEC use from birth through 
kindergarten entry. In addition, we were not able to assess the 
quality of children’s EEC experiences. Although EEC quality is an 
essential concern of policy makers and practitioners, it is important 
to reiterate that a number of studies have found the type and 
duration of EEC children experience to be stronger predictors of 
children’s functioning than EEC quality as assessed using current 
measures (Coley, Lombardi, et al., 2013; McCartney et al., 2010; 
NICHD ECCRN, 2003, 2006; Peisner-Feinberg et al., 2001; 
Votruba-Drzal et al., 2013). Moreover, scholars are questioning 
the validity of many existing measures of EEC quality (Gordon, 
Fujimoto, Kaestner, Korenman, & Abner, 2013; Sabol, Soliday 
Hong, Pianta, & Burchinal, 2013), highlighting the need for mea- 
surement development and a reexamination of components of EEC 
experiences most promotive of children’s successful development. 
Finally, we reiterate that these data were correlational, precluding 
us from drawing truly causal inferences; yet the results emerged 
controlling for a wide range of child, parent, and family charac- 
teristics as well as earlier child cognitive/behavioral functioning, 
factors that may influence both children’s differential selection 
into EEC as well as their later functioning. Moreover, our models 
incorporated propensity score weighting (PSW) techniques de- 
signed to better control for preexisting differences in children and 
families and hence identify less biased connections between EEC 
experiences and children’s later functioning (Imbens, 2000). 

It is also important to consider the size and practical significance 
of the findings. The effect sizes of center-based EEC duration, 
intensity, and timing on children’s cognitive skills and behaviors 
were consistently small, ranging from .06 to .20 SDs. These effect 
sizes are relatively comparable to those reported in much of the 
EEC research from the United States, which are most often in the 
.10 to .20 SD range (Coley, Lombardi, et al., 2013; Loeb et al., 
2007), although many of the U.S. analyses assessed child func- 
tioning during kindergarten, a shorter lag time than used in the 
current research assessing children’s functioning in first grade. 
One manner of assessing the importance of such effect sizes is to 
compare them to effects of other variables. In the current research, 
for example, effects of EEC were of a similar size to the difference 
between parents with a college degree versus a high school degree, 
which was associated with a .25 SD shift in children’s matrix 
reasoning skills and shifts of .07 SD and .04 SD in parent reports 
of attention skills and teacher reports of conduct problems, respec- 
tively, in the multivariate models adjusting for EEC and other child 


and family covariates. EEC effects were also similar in size to a 
1.0 SD increment in the cognitive stimulation provided in chil- 
dren’s home environments, which predicted increments of .04 to 
.06 SDs in children’s cognitive and behavioral skills. 

Another mechanism for considering the practical importance of 
findings is to address the probability of longer-term repercussions. 
Recent long-term follow-ups of EEC effects in the United States 
have reported evidence that negative effects of center-based EEC 
on behavior problems, although small, remained relatively stable 
through elementary school (Belsky et al., 2007). Other work has 
suggested that although measurable effects on achievement may 
fade out during middle childhood, long-term benefits reemerge in 
adulthood (Deming, 2009), suggesting that even small effects on 
cognitive and behavioral skills in early childhood may have long- 
term consequences for children. Scholars have hypothesized that 
such “sleeper” effects may be due to noncognitive skills such as 
task persistence or social skills (Chetty et al., 2011; Deming, 2009) 
that in turn affect long-term educational and economic outcomes. 

Together, results from this study replicate and extend prior 
research from the United States arguing that extensive center- 
based EEC may provide both benefits (to fluid intelligence skills) 
and risks (to behavioral skills) for children’s later development 
(e.g., Coley, Lombardi, et al., 2013; Magnuson et al., 2007; 
NICHD ECCRN, 2003; Phillips et al., 2006; Votruba-Drzal et al., 
2013). Our results extend this literature by carefully attending to 
duration, intensity, and timing of center-based care, finding that it 
is an accumulation of center exposure over developmental periods 
and with greater intensity that matters most for both cognitive 
skills and behavioral functioning. Also replicating U.S. research, 
exposure to more informal and home-based care settings was not 
significantly associated with children’s later functioning, nor was 
use of multiple concurrent EEC arrangements. 

Hence, one central message from this research concerns the 
replication of the general pattern of associations between center 
EEC and children’s functioning, replication that is notable given 
the differences in EEC policy and accessibility. Though Australia 
has greater regulations of EEC quality standards and offers more 
direct and subsidized funding for EEC, center child care appears to 
hold generally similar repercussions for both Australian and Amer- 
ican children. Together, results from this work echo recent calls for 
the need to further delineate mechanisms and to develop and 
replicate center-based EEC models that best support children’s 
cognitive skills while also promoting successful behavioral devel- 
opment. Research assessing publicly supported American EEC 
programs, including public pre-K programs (Gormley et al., 2011) 
as well as Head Start (U.S. Department of Health and Human 
Services, Administration for Children and Families, 2010), has 
found that socioemotional functioning is not compromised when 
children attend high quality school-based pre-K programs. A va- 
riety of potential mechanisms have been hypothesized to explain 
these findings, including high overall program quality, high levels 
of teacher education and teacher salaries, as well as extensive 
attention to classroom management (Gormley et al., 2011; see also 
Raver et al., 2009). The Australian government recently committed 
to making access to part-time center-based preschool with quali- 
fied teachers universal and to implement a nationally regulated 
quality assessment system (Council of Australian Governments, 
2014). These changes may offer an opportunity to further study 
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and identify promising EEC practices to best support children’s 
healthy development. 

Results from this study also provide some points of divergence 
from studies with American children. Most notably, links between 
EEC and children’s cognitive skills were narrower than has been 
shown in research with American samples: Here, results emerged 
only for children’s nonverbal fluid intelligence and not their re- 
ceptive vocabulary or general academic skills (which included 
language, literacy, and math skills). Much of the prior U.S. re- 
search has considered direct assessments of children’s math and 
reading skills rather than relying on teacher reports as the LSAC 
did, which may be less valid and reliable (e.g., Duncan & NICHD 
ECCRN, 2003; Gormley et al., 2005; Loeb et al., 2007; Magnuson 
et al., 2004; Votruba-Drzal et al., 2013). 

The nonsignificant associations between EEC and children’s 
language, literacy, and math skills in this sample also may be 
related to a second arena of divergence with results from American 
samples: the lack of differential associations between EEC and 
children’s skills as a function of family socioeconomic status. One 
possible explanation for the small or neutral effects of EEC on 
children’s cognitive and behavioral skills is that these small aver- 
age effects are hiding significant heterogeneity. For example, prior 
work from the United States has found that center EEC is signif- 
icantly beneficial for cognitive skills development in children of 
parents with lower incomes, less education, and lower provision of 
cognitive stimulation in their home environments while showing 
neutral or even slightly negative effects on the cognitive skills of 
children from more advantaged families (Loeb et al., 2004, 2007; 
McCartney et al., 2007; Votruba-Drzal et al., 2013; but see Belsky 
et al., 2007 and other NICHD work that has not found income 
moderation). In the LSAC sample, in contrast, we found no evi- 
dence for moderation of EEC effects by family income, parent 
education, or home cognitive stimulation, suggesting that the small 
benefits for children’s fluid intelligence and risks for children’s 
behavioral skills derived from greater exposure to center EEC 
were shared broadly across Australian children. 

What might explain these different patterns of results? Recent 
cross-national work has argued that income inequality and poverty 
are notably lower in Australia than in the United States and further 
that differentials in child functioning related to family income, 
particular differentials in cognitive skills, are also lower in Aus- 
tralia (Bradbury, Corak, Waldfogel, & Washbrook, 2012). If there 
are narrower gaps in children’s socioeconomic resources as well as 
narrower gaps in children’s cognitive skills related to socioeco- 
nomic resources in Australia, then EEC in Australia will have a 
much weaker potential to help close such gaps. This line of 
reasoning helps to explain both the limited total effects of EEC on 
children’s language and academic skills in the LSAC sample, as 
well as the lack of EEC by SES moderation. We reiterate the lack 
of moderation by children’s temperament as well, furthering the 
argument that the small effects unearthed in this study were shared 
broadly across subgroups of children. 

In conclusion, our analysis sought to replicate and extend prior 
EEC research by incorporating sophisticated modeling techniques 
to help adjust for selection bias in delineating prospective associ- 
ations between EEC and children’s functioning among children in 
Australia, a country with higher EEC attendance and quality stan- 
dards, and a more homogenous population than the United States. 
Our results largely replicated prior results with American samples 


of children suggesting that greater duration and intensity of center 
EEC from infancy through preschool is linked with small advan- 
tages in nonverbal fluid intelligence skills and small detriments in 
behavioral skills for Australian children. These effects were not 
found to be differentiated by family socioeconomic characteristics 
or child temperament suggesting that EEC has small average 
effects for Australian children. As Australia and the United States 
both seek to expand access to high-quality EEC programs, results 
from this work add to the growing literature suggesting the need to 
develop and replicate center-based EEC models that best support 
children’s cognitive skills while also promoting successful behav- 
ioral development. 
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Appendix 
Estimating Each Child’s Propensity to Be in Center EEC 


a Sk a ee ae a Pg 


Duration of care Intensity of care Patterns of care 


% waves center Avg. hours in Early center, no Early center plus 


Note. 


Belo! 


EEC = early education and care; Avg. = average. 
pi 05: 


Variable care center care preschool Preschool only preschool 

Wave | predictors 
Child age 0.004 (0.001)** 0.009 (0.004)* 0.116 (0.053)* —0.061 (0.029)* —0.007 (0.028) 
Child male —0.003 (0.008) —0.020 (0.022) 0.010 (0.275) 0.189 (0.148) 0.109 (0.151) 
Child low birth weight —0.011 (0.017) 0.024 (0.047) —0.722 (0.742) 0.019 (0.325) —0.135 (0.315) 
Child bad health 0.059 (0.027)* 0.164 (0.076)* —0.338 (1.059) —0.022 (0.495) 0.289 (0.497) 
Child cognitive skills —0.002 (0.000)** —0.008 (0.001)** 0.000 (0.016) 0.001 (0.009) —0.016 (0.009)* 
Child temperament 0.000 (0.007) 0.022 (0.020) 0.156 (0.235) 0.085 (0.135) 0.076 (0.134) 
Parent Asian —0.059 (0.017)™ —0.094 (0.05)* —0.237 (0.744) —0.105 (0.325) —0.401 (0.314) 
Parent Aboriginal —0.054 (0.021)* 0.010 (0.066) 0.148 (0.473) —0.665 (0.285)* —0.650 (0.268)* 
Immigrant household 0.014 (0.010) 0.076 (0.031)* —0.081 (0.378) —0.096 (0.194) 0.014 (0.186) 
Non-English household —0.073 (0.015)** —0.092 (0.046)* —0.344 (0.598) —0.286 (0.252) —0.751 (0.249)™ 
Child number siblings —0.039 (0.004)** —0.095 (0.010)** —0.177 (0.168) —0.339 (0.065)** —0.515 (0.062) 
Parent married 0.029 (0.010)™ —0.029 (0.030) —0.344 (0.354) 0.480 (0.198)* 0.502 (0.192)* 
Youngest parent’s age 0.002 (0.001)* 0.005 (0.003)* —0.030 (0.039) 0.025 (0.018) 0.013 (0.018) 
Parent < high school education —0.035 (0.027) —0.118 (0.075) —0.664 (0.860) —0.327 (0.394) —0.394 (0.418) 
Parent some college 0.027 (0.021) 0.041 (0.059) 0.003 (0.826) —0.206 (0.327) 0.114 (0.335) 
Parent college/grad school 0.053 (0.021)* 0.128 (0.059)* —0.022 (0.811) 0.029 (0.323) 0.403 (0.325) 
Household income 0.000 (0.000)** 0.000 (0.000)** 0.000 (0.000) 0.000 (0.000)** 0.000 (0.000)** 
Mother employment hours 0.004 (0.000)** 0.020 (0.001)** 0.004 (0.014) —0.002 (0.008) 0.016 (0.008)* 
F of model 28.75 34.35 Si) 57/2) 5.72 

R?/Pseudo R? 0.12 0.17 

Range of propensity scores 0.073-0.861 0.055-—2.660 0.001-0.289 0.164-0.746 0.059-0.812 


Received April 24, 2013 
Revision received June 11, 2014 
Accepted June 17, 2014 @ 


Journal of Educational Psychology 


2015, Vol. 107, No. 1, 300-308 


FernUniversitat in Hagen 


‘He Who Can, Does; He Who Cannot, Teaches?”: 
Stereotype Threat and Preservice Teachers 
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Stereotype threat is defined as a situational threat that diminishes performance, originating from a 
negative stereotype about one’s own social group. In 3 studies, we seek to determine whether there are 
indeed negative stereotypes of students who have chosen a career in teaching, and whether the 
performance of these students is affected by stereotype threat. Responses to open-ended questions (Study 
1, N = 82) and comparisons in closed-ended response format (Study 2, N = 120) showed that preservice 
teachers are perceived as having a low level of competence and a high level of warmth, in keeping with 
the paternalistic stereotype. We conclude that a stereotype does indeed exist that attributes lower 
competence to prospective teachers. In Study 3 (NV = 262), a group of preservice teachers was subjected 
to stereotype threat. In keeping with the stereotype threat model, that group performed worse on a 
cognitive test than the group of similar students who were not under stereotype threat; the performance 
of students in the field psychology did not differ in response to the threat condition. This study is the Ist 
to show the effects of stereotype threat on students preparing for a teaching career. 
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The term stereotype threat refers to a situational threat that 
diminishes performance, originating from a negative stereotype 
about one’s own social group (Steele, 1997; Steele & Aronson, 
1995). Situational pressure results when members of a group find 
themselves in a situation that is associated with a negative stereo- 
type of that group, and they are anxious about confirming the 
stereotype or being judged by it. This leads to an emotional 
response that impairs the individual’s cognitive functioning and 
performance (Schmader, Johns, & Forbes, 2008). Considerable 
evidence has shown that female students perform less well than 
their male counterparts in mathematics and related fields. Girls 
who are reminded of their gender before taking a math test worry 
about confirming the stereotype that women are less capable than 
men in math and science. These negative feelings and. thoughts 
consume some of the cognitive resources necessary for performing 
well on the test, leading to results that are worse than they would 
be otherwise (Huguet & Régner, 2007; Ihme & Mauch, 2007; 
Jordan & Lovett, 2007; Keller, 2007; Keller & Dauenheimer, 
2003; Muzzatti & Agnoli, 2007, among others). In addition to 
lowering school achievement, stereotype threat interferes with 
learning itself (Rydell, Rydell, & Boucher, 2010; Rydell, Shiffrin, 
Boucher, Van Loo, & Rydell, 2010) and weakens identification 
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with the affected domain or group (e.g., Woodcock, Hernandez, 
Estrada, & Schultz, 2012). 

Ethnic stereotypes, too, have been the subject of considerable 
study. McKown and Weinstein (2003) found that the concentration 
and working memory of African American and Latino students 
declined after they were reminded of the stereotype of these groups 
as intellectually inferior. Désert, Préaux, and Jund (2009) have 
shown how stereotype threat affects the performance of children 
from socially disadvantaged families on intelligence tests. Stereo- 
type threat is an issue not only for students but also for individuals 
in the labor force. von Hippel, Kalokerinos, and Henry (2013), for 
example, have found that older workers who face age-related 
stereotypes are less satisfied with their jobs and more likely to quit. 

Because negative stereotypes are associated with certain types 
of jobs, stereotype threat may play a role in that context as well. 
However, unlike other social groups (such as those related to 
gender or ethnicity), jobs are usually chosen. As a result, it is 
possible to leave profession-related groups. On the one hand, 
people might therefore regard their membership in these groups as 
less binding and less significant, and this might make them less 
vulnerable or even immune to stereotype threat. On the other hand, 
people might regard their membership as binding and significant 
precisely because of their choice (Fisher & Andrews, 1976), and 
this might make them vulnerable to stereotype threat. 

The focus of our studies is to determine whether stereotype 
threat affects students training for a career in teaching. This is an 
important question, because stereotype threat may have a detri- 
mental effect on learning and performance (Rydell et al., 2010; 
Taylor & Walton, 2011) and ultimately lead to disidentification 
with the field of study (and thus also the teaching profession), 
possibly even causing students to drop out (see Milner & Woolfolk 
Hoy, 2003, for a discussion of the possible effects of stereotype 
threat on African American teachers). 
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Stereotype Threat in the Teaching Profession and 
Teacher Training 


Teachers and preservice teachers are subject to considerable 
negative stereotyping (Blémeke, 2005; Spinath, van Ophuysen, & 
Heise, 2005; Swetnam, 1992). George Bernard Shaw’s comment 
that “he who can, does; he who cannot, teaches” reflects an attitude 
toward the teaching profession that is still common today (teach- 
ing, 2013). It is a view that is often found in the media as well 
(Blémeke, 2005; Swetnam, 1992). Even preservice teachers them- 
selves seem to share these negative perceptions; a study by 
Carlsson and Bjorklund (2010) has shown that they, too, view 
preschool teachers as considerably less competent than lawyers 
(though high in warmth). 

It therefore appears that teachers, more than other occupational 
groups, are viewed as less competent, and they are confronted with 
these negative stereotypes even during their training. It is widely 
believed that preservice teachers are weaker than other students in 
the qualities needed to earn an academic degree (such as intelli- 
gence and achievement motivation; Spinath et al., 2005). In their 
comparative study Spinath et al. (2005) looked at various fields of 
study (education science, economics, mathematics, natural sci- 
ences, and engineering, as well as teacher training programs 
geared to a variety of school types), finding no difference between 
the preservice teachers and other students in terms of intelligence, 
achievement motivation, or reading skills. However, Spinath and 
colleagues (2005) also concluded that despite the lack of real 
differences in cognitive ability, the very existence of these stereo- 
types can have a negative effect. Yet, no systematic studies have 
investigated the incidence and nature of stereotypes associated 
with preservice teachers or looked at the effects of stereotype 
threat on this population. 

Our studies are intended to show whether preservice teachers 
believe that others view them as less competent and whether 
that stereotype leaves them vulnerable to stereotype threat. Our 
first step was to conduct two exploratory studies (Studies 1 and 
2) to determine the general salience of these stereotypes. On the 
basis of the stereotype content model (Fiske, Cuddy, Glick, & 
Xu, 2002; Fiske, Xu, Cuddy, & Glick, 1999), we examined 
whether preservice teachers (Study 1) and other persons (Study 
2) believe in the existence of the oft-cited stereotype that 
preservice teachers are less capable. Often research on stereo- 
type threat simply assumes the existence of the investigated 
stereotype. This may be justified in certain cases (such as 
gender stereotypes on performance in mathematics, for exam- 
ple) but needs confirmation in others. The more a stereotype is 
widely held and consensual, the more likely it is to be chroni- 
cally accessible: Widely held stereotypes may require only 
subtle hints to become salient (such as TV spots not even 
directly mentioning or showing the stereotype in question— 
e.g., in case of gender stereotypes). That is why we conducted 
two studies using different methods and inviting different par- 
ticipants. 

Of central importance is our experimental study (Study 3) in 
which preservice teachers were confronted with this stereotype 
when taking a cognitive test. Their results were compared with 
those of other preservice teachers who were not subjected to 
stereotype threat, as well as with a control group. 


Study 1 


In the first study, consisting of open-ended questions, preservice 
teachers were asked to describe the stereotypes of their ingroup. 
Their responses were categorized according to the stereotype con- 
tent model (Fiske et al., 1999, 2002), which classifies social 
stereotypes in terms of two dimensions—competence and 
warmth—using a four-quadrant matrix. Various combinations of 
these two qualities result in four stereotypes: The paternalistic 
stereotype is usually associated with individuals of lower status 
(housewives, older people) who do not represent competition for 
social resources; these individuals are perceived to have a low 
level of competence and a high level of warmth. “Admiration” 
attributes high scores on both dimensions to the respective groups 
(e.g., the ingroup). Groups that fit the contemptuous stereotype 
(e.g., the homeless) are viewed as incompetent and cold, whereas 
the envious stereotype applies to groups regarded as competent, 
but cold (e.g., the rich). 

Given what we know about the negative stereotyping of preser- 
vice teachers and the stereotypes found in the literature, we would 
expect these students to fit the paternalistic stereotype most close- 
ly—in other words, to be viewed as low on competence but high 
on warmth. 


Method 


Sample. Members of the sample—students at a university in 
northern Germany (NV = 82) who were enrolled in a master’s 
program in teaching (academic track Gymnasium, M = 9 semes- 
ters, SD = 1.78)—in Study 1 were asked about stereotypes of their 
group. The average age of the respondents was M = 26 (SD = 
4.58); 76.8% of them were women. Respondents were recruited 
through university courses. 

To clarify the background of the preservice teacher samples in 
Study 1 and Study 3 (the sample in Study 2 consisted of students 
from other fields of study as well as participants from different 
nonuniversity backgrounds), we summarize a typical teacher train- 
ing program in Germany. In most German federal states, teacher 
training at university is composed of a 3-year bachelor’s (B.A./ 
B.Sc.) program and a 2-year master’s (M.Ed.) program, including 
the academic study of two scientific disciplines and didactics for 
the corresponding school subjects. Additionally, students take a 
variety of courses in educational sciences. During their time at the 
university, they attend the same courses as students not training to 
be teachers (e.g., preservice mathematics teachers attend together 
with students who want to become mathematicians). After com- 
pleting the master’s program, they start their on-the-job training in 
schools (Referendariat). After 1—2 years of training, they become 
proper teachers. 

Procedure. The survey was conducted at the beginning of 
these courses. The students were given 5 min to answer the 
following question in a few key words: “In your opinion, what 
characteristics do other students ascribe to the ‘typical student’ 
preparing for a career in teaching?” 

Other studies have used similar open-ended questions to gener- 
ate stereotypes (e.g., Devine, 1989). In this case, the purpose was 
to determine how preservice teachers believe they are perceived by 
other students. The sample was limited to students preparing for a 
teaching career, because stereotype threat will only have an effect 
if participants are aware of a negative stereotype and assume that 
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they are being judged by that stereotype (Steele, 1997). We chose 
an open-ended format so that respondents could freely generate 
stereotypes. 

Analysis. Study 1 recorded N = 398 statements (average M = 
4.84 statements per person). Two student reviewers categorized 
responses on the basis of the two dimensions of the stereotype 
content model (competence, warmth). They used two five-step 
scales to rate each response in terms of competence (ranging from 
—2 = indicative of low competence to 0 = neutral to +2 = 
indicative of high competence) and warmth (ranging from —2 = 
indicative of low warmth to 0 = neutral to +2 = indicative of high 
warmth). Responses such as “incapable” or “not suited to aca- 
demic studies” were consistently classified as indicating incompe- 
tence. Responses like “vain” or “politically on the left” were 
considered to be neutral. Descriptions such as “capable” and 
“striving” were viewed as indicative of competence. 

To confirm that the two dimensions of competence and warmth 
formed the basis of reviewers’ categorizations, we followed a 
three-step procedure. First, we randomly split the sample of state- 
ments. Second, a factor analysis of the four categorizations (one 
supposed competence and one supposed warmth rating per re- 
viewer) made by the reviewers was conducted on one half the 
sample. The analysis resulted in two clearly distinguishable factors 
(all factor loadings > 1.731) explaining 87.20% of the variance. 
Both competence ratings formed one factor, whereas both warmth 
ratings formed the other one. Third, a confirmative factor analysis 
of both reviewers’ categorizations on the competence and the 
warmth dimension (using the other half of the sample) produced an 
acceptable model fit, y7(1) = 1.78, p <-.25, root-mean-square 
error of approximation = .06, comparative fit index = 1.00, thus 
confirming the two-dimensional! structure. 


Results 


We analyzed the mean levels of competence and warmth on the 
basis of the reviewers’ ratings. If the two student reviewers dis- 
agreed, a third reviewer was asked to render a final decision that 
was included in the mean score. Because we applied two f tests to 
investigate the mean levels of competence and warmth attributed 
to preservice teachers, the Bonferroni method used to correct the 
alpha level resulted in an alpha level of .025. The mean level of 
competence attributed to these students was M = —.38 (SD = 
.96), a value significantly below the midpoint of the scale, 
t(397) = —7.86, p < .001, d = .40. The mean level of warmth was 
M = .29 (SD = .97). This value, too, deviated—in this case in a 
positive direction—from the midpoint of the scale, t(397) = 6.07, 
p < .001, d = .30. Thus, preservice teachers were perceived to 
have a low level of competence and a high level of warmth. 


Discussion 


The first study consisted of an open-ended survey aimed at 
identifying the stereotypes preservice teachers believe others at- 
tach to them. The stereotype content model was used to analyze 
their responses. Results showed that these students are aware that 
others view them as less competent, a negative stereotype that was 
identified by Spinath and colleagues (2005). Overall, the first 
study painted a picture of a less competent but sociable group, in 
keeping with the paternalistic stereotype. 


However, because Study 1 was a qualitative survey of a very 
selective sdmple of members of the potentially stereotyped group, 
it is not able to shed light on the salience of the stereotype, nor 
does this kind of survey reveal just how negative these stereotypes 
are, relative to other groups. We therefore conducted another study 
with a larger sample, which allowed us to compare the group of 
preservice teachers with other groups of students in an effort to 
find additional empirical evidence of a stereotype that preservice 
teachers are less competent. This study was intended to determine 
the general salience of this negative stereotype and its relative 
level of negativity. 


Study 2 


The second study was conducted online, and various groups 
(students, employed individuals, etc.) were asked to describe the 
characteristics of students in several fields. The survey was de- 
signed in accordance with the stereotype content model (Fiske et 
al., 1999, 2002). We used a method established by Fiske et al. 
(1999) to assess these groups’ competence and warmth (see be- 
low). 


Method 


Sample. The analysis included data on N = 120 individuals. 
The average age of the respondents was M = 29 (SD = 9.92; 
range = 16—62 years); 73.3% of them were women. The majority 
were university students (n = 84) of various fields of study. The 
remaining participants were either members of the workforce (n = 
22), people without employment (n = 11), or pupils (n = 3). In 
keeping with Reips’ (2002) recommendations for online research, 
respondents were included in the analysis only if it was clear that 
they had completed the survey without interruption. Thus, n = 3 
individuals had been excluded prior to the analysis. 

Procedure. To test the assumption that preservice teachers fit 
the paternalistic stereotype, we looked at students in four disci- 
plines (teacher training, law, computer science, and psychology) in 
light of the dimensions of the stereotype content model, analogous 
to the work of Fiske et al. (1999). The respondents were given a 
list of the 27 characteristics used by Fiske and colleagues (1999), 
translated into German, for the purpose of identifying levels of 
competence and warmth. 

Respondents were instructed as follows: 


The purpose of this survey is to determine how students in four 
different fields (teacher training, law, computer science, and psychol- 
ogy) are perceived. We are not interested in your personal beliefs, but 
in how you think these groups are viewed by others. 


This wording was modeled after Fiske et al. (1999) and intended 
to prevent respondents from answering in a way they considered 
socially acceptable. Respondents indicated the degree to which 
each characteristic applied to the respective group, using a 5-point 
Likert scale ranging from 0 (not at all) to 4 (extremely); the 
characteristics were listed in random order. On the basis of these 
responses, we estimated the degree to which each characteristic 
applied to students in the four groups. 

Analysis. The first steps of our analysis follow Fiske et al. 
(1999). A principal components analysis of the 27 characteristics 
was conducted for each group (teaching, law, computer science, 
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and psychology students), for a total of four analyses. Out of the 
resulting factors, oblique rotation, we identified the ones on which 
the items “competent” (translated as “kompetent”) and “likable” 
(translated as “sympathisch”) loaded the highest (factor loadings 
of over |.40I). Then the characteristics were identified that consis- 
tently, for all four groups, loaded on the same factor (once again, 
factor loadings of over |.40!), either with “competent” or with 
“likable.” In contrast to Fiske et al. (1999), we included only the 
characteristics that were present in the factor solutions for all of the 
groups. The item “competent” loaded with “industrious,” “intelli- 
gent,” and “determined.” The item “likable” loaded with “helpful,” 
“sincere,” “warm,” and “kind” (see Table 1). The resulting scales 
were sufficiently reliable (.80 = a S .85). 

We subsequently conducted within-subject analyses of variance 
(ANOVAs), comparing scale values for competence and warmth 
(see Table 2) for the four groups. Our central hypothesis suggests 
that preservice teachers are regarded as less competent than other 
students. 


Results 


The results show that competence ratings differ between differ- 
ent fields of study, F(3, 357) = 52.55, p < .001, d = 1.33. Paired 
comparisons (applying the Bonferroni method) were made to 
examine which pairs of means differed. Preservice teachers were 
perceived to be significantly less competent than the other groups 
(all ps < .001). The other groups (psychology, law, and computer 
science students) showed no differences in perceived competence 
(all ps > .07). 

The results show that warmth ratings differ between different 
fields of study, F(2.8, 336.4) = 132.78, p < .001, d = 2.10.’ 
Again, paired comparisons were made to examine which pairs of 
means differed. Preservice teachers were seen as scoring signifi- 
cantly higher on warmth than law or computer science students 
(ps < .001), but this was not the case when they were compared 
with psychology students (p = .65). Psychology students, too, 
were seen as warmer than law or computer science students (ps < 
.001). Furthermore, law students were considered to be signifi- 
cantly less warm than computer science students (p < .001). 

A final t test for dependent samples showed that preservice 
teachers were regarded as significantly less competent than warm, 
(119) = —7.41, p < .001, d = .72. From the perspective of the 
stereotype content model, they therefore fit the paternalistic ste- 
reotype, with low competence and high likability. 





Table 1 

Traits for Study 2 

Competent Warm Arrogant Determined 
Likable Gullible Industrious Tolerant 
Helpful Confident Gentle Complaining 
Spineless Dictatorial Intelligent Irritable 
Sincere Competitive Good-natured Egoistical 
Cold Independent Kind Passive 
Hostile Whiny Greedy 


Note. These traits were adapted from Fiske, Cuddy, and Glick (1999) for 
the purpose of the present research and translated into German. 


Table 2 
Study 2: Means (and Standard Deviations) for Competence and 
Warmth of the Assessed Fields of Study 





Dimension Field of study M (SD) 

Competence Teacher training 2'02:€72) 
Law 2.90 (.71) 
Computer science 2.75 (.61) 
Psychology 2.73 (.74) 

Warmth Teacher training 2.60 (.58) 
Law 1.39 (.60) 
Computer science 1.84 (.57) 
Psychology 2.50 (.65) 

Discussion 


The open-ended questions posed in Study 1 revealed that per- 
ceptions of preservice teachers corresponded to the paternalistic 
stereotype. In Study 2, we compared those stereotypes with ste- 
reotypes of students in other fields. As expected, perceptions of 
preservice teachers conformed to the paternalistic stereotype pro- 
posed by the stereotype content model, with low scores for com- 
petence and higher ones for warmth. This is in keeping with the 
findings of Carlsson and Bjorklund (2010) in their study of pre- 
school teachers. Our findings also confirmed the existence of a 
negative stereotype related to the competence of preservice teach- 
ers, as the literature suggests. The next question was whether the 
performance of these students is adversely affected when that 
stereotype is made salient. Study 3 was devoted to answering that 
question. 


Study 3 


In Study 3, we tested the hypothesis that competence-related 
stereotype threat leads to weaker performance by preservice teach- 
ers: Preservice teachers who are subjected to stereotype threat 
perform less well on a test of cognitive ability than preservice 
teachers who are not. 


Method 


Sample. Test participants included N = 262 preservice teach- 
ers (academic track Gymnasium, n = 134) and psychology stu- 
dents (n = 128) at a university in northern Germany (M = 23.69 
years of age, SD = 4.35; 72.9% female) who were recruited 
through university courses. The psychology students were in- 
cluded as a control group. 

Independent variables. The independent variables were the 
test condition (IV1: Stereotype threat present vs. not present) 
and the academic field of the test subjects ([V2: teaching vs. 
psychology). Participants were randomly assigned to the re- 
spective [V1 condition. Stereotype threat was conveyed through 
the test instructions. We consciously chose a moderately ex- 
plicit approach (Nguyen & Ryan, 2008) by activating different 
fields of study as social category with no explicit mention of 
expectation of weaker performance for three reasons: First, in 


' Because the Mauchly test showed a significant result, the Greenhouse— 
Geisser estimator was applied. 
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everyday life, stereotype threat is generally activated almost 
unnoticeably (Steele, 2010). Second, as numerous studies have 
shown, activating stereotype threat does not require a direct link 
to the relevant stereotype (Davies, Spencer, Quinn, & Gerhard- 
stein, 2002; Ihme & Mauch, 2007); however, the stereotype 
itself must be sufficiently salient. Third, we wanted to rule out 
the possibility of stereotype reactance, which occurs when 
individuals feel challenged by the explicit activation of a neg- 
ative stereotype and respond by acting in a way that is contrary 
to the stereotype (Kray, Thompson, & Galinsky, 2001). A more 
subtle approach to activating the stereotype is intended to 
prevent that effect. 

Psychology was selected as the comparison discipline based 
on the finding of Study 2 that psychology students and preser- 
vice teachers differ only in the stereotype content model’s 
competence dimension and not in warmth. Psychology students, 
although similarly stereotyped in the warmth dimension and 
similar to preservice teachers in other characteristics (e.g., 
gender distribution), constitute a different group who is not 
targeted by the performance-related stereotype. When group 
membership is made salient, stereotype threat should affect the 
performance of preservice teachers only. In stereotype threat 
research, the choice of the comparison group is most often a 
trivial point. In case the threatened group is an ethnic minority, 
the comparison group is often the ethnic majority; in case the 
threatened group consists of female participants, the compari- 
son group consists of male participants, and so on. However, in 
case of different fields of study, the answer is less straightfor- 
ward. In the end, we decided for a rather similar comparison 
group so that any effect we found could be attributed to this one 
particular stereotype that we targeted. 

Dependent variables. To test performance, we used the ma- 
trix subtest from the Intelligence Structure Test 2000R (Liepmann, 
Beauducel, Brocke, & Amthauer, 2007). It was chosen because it 
does not include mathematics or verbal items, which might acti- 
vate subject-specific, school-related aspects of the study partici- 
pants’ self-concept. For each of the 20 items on the test, the 
participants were asked to select from five symbols the one that 
completes a pattern consisting of between four and nine symbols. 
They were given 10 min to complete this task, and could score a 
maximum of 20 points. The internal consistency of the test items 
in the sample was a = .64. 

Procedure. The test was conducted in groups of up to 50 
participants in an introductory psychology course. All experimen- 
tal sessions took place in the same lecture hall and at the same time 
of day. All participants in one session were either preservice 
teachers or students of psychology. After welcoming the partici- 
pants, the investigator explained that they were to take a multiple- 
choice test and that further details about the purpose of the study 
could be found in the standardized instructions. Each participant 
was then given an envelope containing the test materials. The 
materials were distributed in envelopes so that the investigator and 
his assistants could not see which participant was assigned to 
which test condition. In each session, half of the materials were 
prepared for either of the experimental conditions (stereotype threat or 
not). The instructions found in the envelope contained the experi- 
mental manipulation. The manipulation for the stereotype threat 
condition stated: 


Dear participant, 


On the following pages you will find a test of cognitive ability. This 
questionnaire is intended to test recent research findings showing a 
significant difference between pre-service teachers and students in » 
other fields (such as psychology, educational science* etc.) in their 
cognitive capabilities. We will be gathering a variety of cognitive data 
from you. Please concentrate carefully as you follow the instructions 
and complete these items to the best of your ability. 


The instruction for the no-stereotype threat condition merely 
stated that the questionnaire was intended to “test recent research 
findings” without giving any further details or mentioning any 
differences in the cognitive abilities of students from different 
fields of study: 


Dear participant, 


On the following pages you will find a test of cognitive ability. This 
questionnaire is intended to test recent research findings. We will be 
gathering a variety of cognitive data from you. Please concentrate 
carefully as you follow the instructions and complete these items to 
the best of your ability. 


Students in both disciplines (preservice teachers and psychol- 
ogy) within the experimental conditions received exactly the same 
instructions. After reading these general instructions, the partici- 
pants read the specific test instructions and then completed the test. 
In a final step, they were asked to provide demographic informa- 
tion and to assess their consent to the manipulation check item. 
The experiment concluded with an extensive debriefing of the 
participants. 

Manipulation check. To assess the effectiveness of the ex- 
perimental manipulation, we asked participants about their beliefs 
concerning the performance differences between students from 
different fields of study: “Students of different fields of study will 
differ in their performance in this test.” Consent to this item was 
measured on a 6-point scale anchored at the endpoints by the 
phrases strongly disagree (1) and strongly agree (6). This method 
resembles the manipulation checks in other studies in which the 
participants were asked about the purpose of the experiment, 
details of the instructions, or their belief in the activated stereotype 
(e.g., Alter, Aronson, Darley, Rodriguez, & Ruble, 2010). 

Analysis. The manipulation check and the matrix test results 
were evaluated using a two-factor ANOVA (test condition: Non- 
stereotype threat condition = 0, stereotype threat condition = 1; 
field of study: preservice teachers = 0, students of psychology = 1).? 
Simple effect tests were used to analyze the results of the matrix test 
in greater detail. 


* The original German term applied in this instruction is Pddagogik. The 
English translation of the term implies a slightly different meaning than the 
original German term. Lehramt (the field of study for the preservice 
teachers in Germany) and Pddagogik are not the same. Whereas the first 
includes elements of the second, students of Padagogik (in Germany) do 
not become teachers but social workers or workers in other areas of 
welfare. 

: Considering that a majority of the sample consisted of female students 
who might also be subject to the low-competence and high-warmth ste- 
reotype, we also tested for possible effects of gender. However, because we 
found none, these analyses are not reported here. 
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Results 


The ANOVA of the manipulation check showed a significant 
main effect for the stereotype threat factor, F(1, 252) = 4.22, p < 
.05, d = .29. Participants subject to the stereotype threat condition 
(M = 4.26, SD = 1.19) were more likely than members of the 
control group to believe that students in different fields would 
perform differently on the matrix test (M = 3.93, SD = 1.39). No 
significant effects were found for the field of study, F(1, 252) = 
01, ns, d < .10, or for the interaction of the factors, F(1, 252) = 
.03, ns, d < .10. The manipulation can therefore be regarded as 
successful. 

Table 3 shows the mean values and standard deviations of the 
participants’ test results (see also Figure 1). The ANOVA (two 
factors: stereotype threat and field of study) revealed a significant 
interaction, F(1, 258) = 6.72, p < .01, d = .35, and a significant 
main effect for the field of study, F(1, 258) = 10.10, p < .01, d= 
.41. The mean value of the preservice teachers’ test results was 
lower than that of the psychology students. The main effect of 
stereotype threat was not significant, F(1, 258) = 2.17, ns, d = .20. 

In order to explain this interaction, post hoc simple effects 
analyses were carried out for each of the independent variables. 
Inspection by the least significant difference test revealed that 
preservice teachers who had been subjected to stereotype threat 
performed worse on the test relative to preservice teachers who 
had not been subjected to stereotype threat, F(1, 258) = 8.44, p < 
.01, d = .50. There was no significant difference between psy- 
chology students in the stereotype threat condition and psychology 
students in the no-stereotype threat condition, F(1, 258) = .62, ns, 
d = .14. We also compared preservice teachers and psychology 
students in the same test condition. Preservice teachers who had 
been subjected to stereotype threat performed worse on the test 
relative to psychology students who had been subjected to stereo- 
type threat, F(1, 258) = 17.00, p < .01, d = .71. There was no 
significant difference between preservice teachers in the no- 
stereotype threat condition and psychology students in the no- 
stereotype threat condition, F(1, 258) = .17, ns, d = .07. 

In summary, preservice teachers who had been subjected to 
stereotype threat proved to be the only participants whose perfor- 
mance was affected by the experimental condition. The main effect 
of the field of study as identified by the ANOVA can therefore be 
attributed to the interaction between test condition and field of 
study. 


Discussion 


In Study 3, we examined whether the stereotype of lower 
cognitive capacity has a negative effect on preservice teachers’ 
intelligence test performance. The results confirm the typical find- 
ings of stereotype threat research: Members of a group associated 
with a negative stereotype (preservice teachers) performed less 


Table 3 
Study 3; Means (and Standard Deviations) of Test Performance 








Variable Stereotype threat No stereotype threat 
Teacher training 10.69 (2.74) 12.19 (3.88) 
Psychology 12.82 (2.61) 12.41 (2.55) 


test 
condition 





Mist 


mean of test performance 





teacher training 
field of study 


psychology 


Figure I. Means of test performance in Study 3 for each field of study 
and for each test condition (stereotype threat [ST]; no stereotype threat 
{non-ST]). Error bars represent standard errors. 


well on a test of cognitive ability when they were subject to 
stereotype threat than members of the same group who were not. 
Stereotype threat had no effect on the performance of members of 
a group who is not associated with that stereotype (psychology 
students). 


General Discussion 


The studies described in this article expand on previous research 
on teacher education and the teaching profession. We have shown, 
first of all, that the stereotype commonly found in the literature 
does indeed exist, which views preservice teachers as less intel- 
lectually capable. Studies 1 and 2 demonstrated that the patriarchal 
stereotype is applied to this group, suggesting less competence but 
greater sociability. This is in keeping with the findings of Carlsson 
and Bjorklund (2010) with regard to preschool teachers. Our 
studies are the first to confirm that preservice teachers are subject 
to negative stereotypes of their competence; in other words, neg- 
ative stereotypes surrounding the teaching profession are present 
even during teacher training. We have also shown, for the first 
time, that stereotype threat has a detrimental effect on the perfor- 
mance of preservice teachers: In our study, the group of preservice 
teachers who was subject to stereotype threat did worse on a test 
of cognitive ability than the group who was not. This is, accord- 
ingly, the first investigation to show evidence of weaker perfor- 
mance by preservice teachers as soon as stereotype threat is 
present. The practical implications of our results are discussed 
below. 

It is important to note that these results are of theoretical 
importance as well. Past research has focused on stereotype threat 
as it relates to groups in which membership is not chosen and 
cannot be changed (beyond certain limits), as in the cases of 
ethnicity and gender. None of the studies cited in the meta- 
analyses of Nguyen and Ryan (2008) or Walton and Cohen (2003), 
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for example, look at groups that people choose to belong to. If 
membership is voluntary, it is also generally possible to leave the 
group. As a result, people might regard their membership as less 
binding and less significant, and this might make them less vul- 
nerable to stereotype threat. However, our studies have shown that 
stereotype threat has an impact even when membership in a group 
is freely chosen. The decisive factor is not whether it is possible to 
leave the group, but whether a negative stereotype of the group 
exists. Future studies should look more closely at the role of group 
identification, for example, to find out whether stereotype threat 
leads to disidentification over the long term, or perhaps even toa 
departure from the group. 

This is an important question for the practical realm as well (i.e., 
for teacher training). Because even a manipulation that did not 
explicitly mention the relevant stereotype was sufficient to activate 
stereotype threat, it appears that the salience of the stereotype, in 
itself, is enough to produce that result in everyday life. Particularly 
when preservice teachers attend the same courses as students in 
other fields, an implicit comparison may well lead them to expe- 
rience stereotype threat and thus to perform at a lower level. Over 
the long term, there is the risk of disidentification. Thus, would-be 
teachers might eventually withdraw from the academic arena or 
reject an identity that is vulnerable to stereotype threat, instead 
choosing a different field of study (see Woodcock et al., 2012, 
among others). 

The occurrence of stereotype threat effects on preservice teach- 
ers depends on several factors. The general salience of the negative 
stereotype may not be the same for all countries or cultures. 
Although we were able to show that there is a negative public 
perception of teachers and a negative performance-related stereo- 
type of preservice teachers in Germany, the same might not be true 
in other countries (Alexander, Chant, & Cox, 1994; Everton, 
Turner, & Hargreaves, 2007; Sahlberg, 2012; Verhoeven, Aelter- 
man, Rots, & Buvens, 2006). In countries or cultures where 
teaching is a profession of high esteem, teacher training institu- 
tions might be able to recruit the very best candidates for teacher 
training. Under such circumstances, general salience of a negative 
performance-related stereotype of preservice teachers is unlikely 
to exist. 

Differences between educational systems for teacher training 
may also influence the occurrence of stereotype threat. In Ger- 
many, preservice teachers spend 5 years at the university before 
they leave the university and start their on-the-job training in 
schools. During their time at the university, they attend the same 
courses as students not training to be teachers (e.g., preservice 
mathematics teachers attend together with students who want to 
become mathematicians). Here, the preservice teachers face the 
implicit or explicit comparisons that might lead them to experience 
stereotype threat. However, during their on-the-job training, pre- 
service teachers are probably safe from stereotype threat for they 
no longer face comparisons with students from other fields of 
study and are exposed to positive role models (professional teach- 
ers; on the effect of positive role models, see, e.g., Huguet & 
Régner, 2007). Therefore, with regard to stereotype threat, preser- 
vice teachers might benefit from training systems that separate 
them and other students. Future research may address this point by 
comparing preservice teachers’ vulnerability to stereotype threat 
(Barnard, Burley, Olivarez, & Crooks, 2008) in different stages of 
their studies. Additionally, comparisons of stereotype threat effects 


on preservice teachers in different teacher training systems might 
also prove insightful. 

Differences within the group of preservice teachers may also be 
important. We did not investigate this question, owing to the 
exploratory nature of our studies. It should be noted, however, that 
preservice teachers differ a great deal with respect to their moti- 
vation and cognitive capabilities. Retelsdorf and Moller (2012) 
investigated motivational predictors of students from different 
teacher education programs in Germany. They found that subject 
interest was strongly related to choosing an academic track, 
whereas educational interest was rather related to the choice of an 
elementary or nonacademic track program. Kaub and colleagues 
(2012) found that the profile of those planning to teach science 
showed a lower level of interest and satisfaction but that their 
cognitive capacities were superior to those of students planning to 
teach other subjects. It is also possible that there are different 
stereotypes of preservice teachers, depending on their specific 
subject areas. Indeed, research has shown that there are substereo- 
types within other stereotyped groups (for more on the group of 
African Americans, for example, see Walzer & Czopp, 2011). 
Future research on stereotype threat as it relates to preservice 
teachers should also seek to identify possible substereotypes, as 
this would paint a more complete picture of how stereotype threat 
affects this group. 

There remain two open questions, which are to be considered in 
future research. First, Study 3 was not designed to reveal the 
mechanism of the stereotype threat effect on preservice teachers. 
Their poor performance might have been a result of cognitive 
effects such as processes associated with working memory (e.g., 
Schmader et al., 2008) or motivational effects such as performance 
avoidance (e.g., Thoman, Smith, Brown, Chase, & Lee, 2013). 
Considering that any measures taken against stereotype threat 
effects on preservice teachers will rely on knowledge of their exact 
mechanism, future research will have to elaborate on that topic. 
Second, we were not able to consider moderating variables, a 
well-known factor in stereotype threat research (see, e.g., Martiny 
& Gotz, 2011). Typical examples, which future studies should take 
into account, include identification with the domain and the ste- 
reotyped group. As various researchers have shown (e.g., Keller, 
2007; Osborne & Walker, 2006), a high level of identification with 
a stereotyped domain increases the impact of stereotype threat— 
although these effects are not entirely clear for all groups or all 
areas of competence (Nguyen & Ryan, 2008). Paradoxically, pre- 
service teachers who are engaged and identify closely with their 
field of study might be particularly vulnerable to stereotype threat. 
The same holds true for identification with the stereotyped group; 
as Armenta (2010), among others, has shown, this too increases the 
effect of stereotype threat. One future hypothesis might be that the 
more someone identifies with becoming a teacher, the greater the 
potential impact of stereotype threat. Furthermore, preservice 
teachers’ identification with either the group of preservice teachers 
or their fields of study may depend on their reasons for choosing 
this profession. This is particularly interesting because the preser- 
vice teacher students choose to self-select into a negatively ste- 
reotyped profession. It goes beyond the purpose of our study to 
analyze the motivation for becoming a teacher. However, we know 
from other research that prospective teachers are strongly moti- 
vated to work with children (Paulick, Retelsdorf, & Moller, 2013; 
Pohlmann & Miller, 2010; Retelsdorf, Bauer, Gebauer, Kauper, & 
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Moller, in press; Watt et al., 2012) while also being interested in 
several practical aspects of the teacher profession (e.g., holidays 
and the good pay [in Germany]). Students joining a teacher train- 
ing program for intrinsic reasons (e.g., subject interest, educational 
interest, etc.) might identify stronger with their group or their field 
of study than students joining for extrinsic reasons (e.g., job 
security, vacation, etc.). Therefore, students with the more ade- 
quate motivation might be the ones most endangered by stereotype 
threat. 

As a final note, we make two methodical observations regarding 
the realization of any future research. First, future research should 
make use of more reliable performance measures. The test applied 
in Study 3 offered only a low (yet still acceptable) internal con- 
sistency. The internal consistency found in our study resembles the 
one offered by the test manual (a = .66). The mean score for the 
participants unaffected by stereotype threat is slightly higher than 
the standard value specified for participants of that age and edu- 
cational background, whereas their average standard deviation 
equals the one given in the test manual. As explained above, the 
test was chosen for the sake of content, and although a low alpha 
does not necessarily render a test score useless or impossible to 
interpret (Carmines & Zeller, 1979; Cronbach, 1951; Schmitt, 
1996), the necessity for a more reliable instrument remains. Sec- 
ond, future research should consider a possible mix-up of stereo- 
types about different social groups that are represented in the 
sample. As we pointed out in Study 3, other social groups might 
share a pattern of stereotypes similar to the one encountered for 
preservice teachers (or any other group that is subject to research, 
for that matter). Therefore, these social groups might be affected 
by a stereotype threat manipulation that does not target them. This 
problem can be dealt with by either excluding possibly problem- 
atic groups or, because this is not always possible, by testing for 
the effects of other group memberships. 

These methodical considerations notwithstanding, our studies 
paint a consistent picture that has clear implications for teacher 
training. Preservice teachers are viewed as less competent, and this 
stereotype, when made salient, has a negative impact on their 
performance. It would therefore be wise to take appropriate mea- 
sures to counteract that stereotype during teacher training, partic- 
ularly when students in this group share courses with students in 
other fields. As a possible first step, instructors and students might 
be informed of the fact that research has shown that—in contrast 
to the stereotype—preservice teachers are not, in fact, less com- 
petent than other students, but perform at a similar level. 
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This research examined whether the benefits of parents’ involvement in children’s learning are due in part 
to value development among children. Four times over the 7th and 8th grades, 825 American and Chinese 
children (M age 12.73 years) reported on their parents’ involvement in their learning and their 
perceptions of the value their parents place on school achievement as well as the value they themselves 
place on it. Children’s academic functioning was assessed via children’s reports and school records. 
Value development partially explained the effects of parents’ involvement on children’s academic 
functioning in the United States and China. For example, the more children reported their parents as 
involved, the more they perceived them as placing value on achievement 6 months later; such perceptions 
in turn predicted the subsequent value children placed on achievement, which foreshadowed enhanced 
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A wealth of research supports the idea that parents’ involvement 
in children’s learning enhances children’s academic functioning 
(for reviews, see Grolnick, Friendly, & Bellas, 2009; Pomerantz, 
Kim, & Cheung, 2012): Children whose parents are involved on 
the school (e.g., attending parent-teacher conferences) and home 
(e.g., discussing school with children) fronts often exhibit en- 
hanced engagement (e.g., use of self-regulated strategies), skills 
(e.g., phonological awareness), and achievement (e.g., grades). 
Notably, parents’ involvement plays a role in children’s academic 
functioning even when aspects of children’s home environment 
such as parents’ income and education are taken into account (e.g., 
Dearing, Kreider, Simpkins, Weiss, 2006; Jeynes, 2005, 2007). 
The effects of parents’ involvement are also not accounted for by 
other dimensions of parenting such as supporting children’s au- 
tonomy (e.g., C. S. Cheung & Pomerantz, 2011; Deslandes, 
Bouchard, & St.-Amant, 1998). 

Research focusing on why parents’ involvement in children’s 
learning benefits children’s academic functioning identifies the 
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development of children’s actual and perceived competencies as 
important (e.g., Dearing et al., 2006; Senechal & LeFevre, 2002). 
However, it has also been argued that parents’ involvement leads 
children to view doing well in school as valuable, which fosters 
children’s engagement in school, enhancing their achievement 
(e.g., Epstein, 1988; Grolnick & Slowiaczek, 1994). Unfortu- 
nately, such a value development model has not been tested. The 
goal of the current research was to address this gap by evaluating 
whether the effect of parents’ involvement on children’s engage- 
ment and grades in school is due in part to the development of 
children’s values in regard to school achievement. Drawing on 
prior theory and research on value transmission (i.e., children’s 
adoption of parents’ values; e.g., Grusec & Goodnow, 1994; Knafo 
& Schwartz, 2009) as well as parents’ involvement in children’s 
learning (e.g., Hill & Tyson, 2009), we hypothesized two pathways 
by which parents’ involvement facilitates children valuing 
achievement in school. 


The Perception-Acceptance Value Development 
Pathway 


Grusec and Goodnow (1994) proposed a two-step process 
model by which parents transmit their values to children. First, 
children must be aware of parents’ values such that they perceive 
them accurately. Second, children must accept parents’ values as 
their own. Both steps are considered key in effective transmission 
of values from generation to generation (e.g., Barni, Ranieri, 
Scabini, & Rosnati, 2011; Knafo & Schwartz, 2009). Grusec and 
Goodnow focused on how the type of discipline parents use with 
children contributes to value transmission by shaping children’s 
perceptions of parents’ values. Several other dimensions of inter- 
actions between parents and children, such as parents’ discussion 
of their values with children (e.g., Knafo & Schwartz, 2004; 
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Okagaki & Bevis, 1999) and the quality of children’s relationships 
with parents (Barni et al., 2011), have also received attention. As 
a commitment of resources (e.g., time, energy, and financial pro- 
visions) to children in the academic arena (Grolnick & Slowiaczek, 
1994), parents’ involvement in children’s learning may be a key 
mechanism by which parents convey to children that they view 
school as important. When parents take the time and trouble to 
participate in school events, children may view parents as placing 
importance on learning. Parents’ involvement on the home front 
may have similar consequences—for example, when parents ask 
children about what they are learning in school or provide children 
with learning resources (e.g., books), they may communicate that 
they see doing well in school as useful. 

When children see parents as valuing achievement in school, 
they may come to value it themselves (e.g., Eccles et al., 1983; 
Grolnick, Ryan, & Deci, 1997). Grusec and Goodnow (1994) 
argued that once children are aware of the values parents hold their 
acceptance of such values as their own is facilitated in the context 
of a warm relationship with parents (see also Barni et al., 2011). 
The commitment of resources characteristic of parents’ involve- 
ment may signal to children that parents care about them. More- 
over, in the context of their involvement, parents may provide 
emotional support for children (e.g., by reacting to children’s 
frustration with homework with soothing words), thereby creating 
a sense of trust in children that may facilitate their adoption of 
parents’ values (e.g., C. S. Cheung & Pomerantz, 2012; Grolnick 
& Slowiaczek, 1994; Grusec, 2002). Parents’ involvement in chil- 
dren’s learning may be a particularly unique dimension of parent- 
ing in that it simultaneously communicates the value parents place 
on doing well in school (Step 1 of Grusec and Goodnow’s model), 
while also leading children to take on this value as their own (Step 
2 of Grusec and Goodnow’s model). Thus, parents’ involvement 
may enhance children’s achievement via a perception-acceptance 
pathway: Parents’ involvement leads children to perceive parents 
as valuing school achievement (path a in Figure 1), thereby height- 
ening the value children themselves place on it (path b in Figure 1). 


The Experience Value Development Pathway 


The perception-acceptance pathway may be accompanied by 
what we label an experience pathway that directly fosters the value 
children place on achievement in school. Although parents’ in- 
volvement in children’s learning likely conveys the value parents’ 
place on children’s school endeavors, it may not always lead 
children to value school achievement via children’s awareness of 
parents’ values (Eccles et al., 1983). When parents become in- 
volved in children’s learning, they may create experiences for 
children that directly heighten the value children place on school 
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achievement. For example, when parents discuss school with chil- 
dren, children may generate reasons for its utility, leading them to 
see doing well in school as valuable (Hill & Tyson, 2009). In a 
somewhat different vein, drawing from Bem’s (1967, 1972) Self- 
Perception Theory, practices such as helping children to sustain 
their effort on their homework until it is finished may lead children 
to conclude that they value doing well in school given how much 
time they invest in it. In the experience pathway, parents’ involve- 
ment creates experiences that lead children to place value on 
school achievement (path c in Figure 1), regardless of their per- 
ceptions of parents’ values. 


The Role of Values in Academic Functioning 


Whether the value children place on achievement in school 
ensuing from parents’ involvement develops via a perception- 
acceptance pathway or an experience pathway, prior theory and 
research (e.g., Eccles et al., 1983; Wang & Pomerantz, 2009) 
indicates it supports children’s academic functioning (see path d in 
Figure 1). In their Expectancy-Value Theory, Eccles et al. (1983) 
made the case that when children value achievement in school, 
they become more engaged in school, which enhances their 
achievement. Indeed, the more children view doing well in school 
as important, the more engaged they are—for example, they use 
heightened self-regulated learning strategies, such as monitoring 
and planning their learning (e.g., Pintrich, 1999; Wang & Pomer- 
antz, 2009). Notably, heightened value as well as engagement 
predicts improved achievement among children over time (e.g., 
Alexander, Entwisle, & Dauber, 1993; Kenney-Benson, Pomer- 
antz, Ryan, & Patrick, 2006; Wang & Pomerantz, 2009). 


Value Development Pathways in the 
United States and China 


Over the last several years, there has been a call to extend the 
understanding of psychological processes beyond Western popu- 
lations (e.g., Arnett, 2008; Henrich, Heine, Norenzayan, 2010a, 
2010b). In the case of parents’ involvement in children’s learning, 
this may be of import when it comes to China because Chinese 
parents are involved differently in children’s learning than are their 
American counterparts (for a review, see Pomerantz, Ng, Cheung, 
& Qu, in press). For one, Chinese (vs. American) parents are more 
involved compared to American parents (e.g., Chen & Stevenson, 
1989; Ng, Pomerantz, & Lam, 2007). Consequently, both the 
perception-acceptance and experience value development path- 
ways may be stronger in China than the United States as such 
heightened involvement may convey more clearly that parents 
value school achievement and create more experiences that di- 
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Figure 1. Hypothesized value development pathways underlying the effect of parental involvement on 
children’s academic functioning. The perception-acceptance pathway is reflected in paths a, b, and d; the 


experience pathway is reflected in paths c and d. 
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rectly heighten the value children place on school achievement. 
Moreover, the amplified commitment of resources reflected in 
parents’ involvement may enhance children’s adoption of parents’ 
values. 

Chinese parents’ involvement in children’s learning, however, is 
more controlling than that of American parents with greater atten- 
tion to children’s mistakes (e.g., C. S. Cheung & Pomerantz, 2011; 
Ng et al., 2007). This along with the tendency for Chinese (vs. 
American) children to feel less close to parents during adolescence 
(e.g., Pomerantz, Qin, Wang, & Chen, 2009) may undermine value 
transmission. Although it is unclear if parents’ involvement sim- 
ilarly fosters value development in China and United States, prior 
examination of the effects of parents’ involvement on children’s 
engagement and grades yields similar effects in the two countries 
(C. S. Cheung & Pomerantz, 2011). 


Overview of the Current Research 


To examine whether parents’ involvement in children’s learning 
enhances children’s academic functioning by heightening the 
value children place on school achievement in the United States 
and China, the current research evaluated the hypothesis that two 
value-development pathways underlie the benefits of parents’ in- 
volvement (see Figure 1). In the perception-acceptance pathway 
(paths a, b, and d), parents’ involvement signals to children that 
parents value school achievement, leading children to value it, 
which in turn enhances children’s achievement. In the experience 
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pathway (paths c and d), parents’ involvement develops the value 
children place on school achievement not through the messages it 
conveys about parents’ values, but rather directly through the 
experiences it creates. Comparisons between the United States and 
China for both pathways were made to evaluate their generaliz- 
ability. 

In testing the value transmission pathways, we focused on 
children in the middle school years because parents’ involvement 
may offset the devaluing of school that often occurs among chil- 
dren during this phase of development (for a review, see Wigfield 
& Wagner, 2005). Children in the United States and China re- 
ported four times over the seventh and eighth grades on parents’ 
involvement in their learning, their perceptions of the value parents 
place on school achievement, and the value they themselves place 
on it. Children’s academic functioning was assessed with chil- 
dren’s reports and school records. The four-wave design allowed 
for the examination of the sequence of effects posited in Figure 1. 
Because each construct was assessed at each wave, autoregressive 
effects could be taken into account (see Figure 2), which permitted 
identification of the direction of effects. 

We investigated two dimensions of children’s academic func- 
tioning that have important implications for children’s lives. First, 
children’s engagement in school is not only predictive of their 
achievement over time (e.g., M.-T. Wang & Fredricks, 2014; Q. 
Wang & Pomerantz, 2009) but also appears to protect children 
against internalizing and externalizing problems (e.g., M.-T. Wang 
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Figure 2. Value development pathways underlying the effect of parents’ involvement on children’s grades. For 
child gender, 1 = boys, 2 = girls; for child residence with parents, 1 = not residing with both parents, 2 = 
residing with both parents. Letters (i.e., a, b, cl, c2, and d) represent links comprising the two value development 
pathways denoted in Figure 1. For ease of presentation, within-wave covariances are not shown. Based on the 
chi-square difference tests, all paths comprising the indirect pathways were constrained to be equal between the 
United States and China. American standardized estimates are above; Chinese standardized estimates are below. 
Solid lines are significant (p < .05); dashed lines are not.“ p < .05.™“ p < .01. apie OO 
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& Fredricks, 2014; M.-T. Wang & Peck, 2013). Children reported 
on two forms of their engagement—their use of self-regulated 
learning strategies and the time they spend on schoolwork outside 
of school. Second, children’s grades in school are a significant 
reflection of their achievement (Duckworth & Seligman, 2005; 
Grolnick et al., 1997) with implications for subsequent opportu- 
nities (e.g., placement in enrichment activities) as well as success 
later in life (e.g., Geiser & Santelices, 2007). There is sizeable 
evidence documenting the importance of parents’ involvement in 
children’s learning for both children’s engagement and grades 
(e.g., C. S. Cheung & Pomerantz, 2011; Grolnick & Slowiaczek, 
1994). 

With the exception of grades, children provided reports for all 
the constructs under study. This is of particular concern when it 
comes to parents’ involvement in children’s learning. Children’s 
reports of such involvement are only modestly associated with 
teachers and parents’ reports (e.g., Bakker, Denessen, & Brus- 
Laeven, 2007; Hill et al., 2004; Reynolds, 1992). However, be- 
cause children, teachers, and parents’ reports of parents’ involve- 
ment each predict unique variance in children’s achievement, it has 
been argued that each captures unique aspects of parents’ involve- 
ment (Reynolds, 1992). Children’s reports reflect their perceptions 
of parents’ involvement. This is significant because children must 
notice parents’ involvement to draw conclusions about parents’ 
values (C. S. Cheung & Pomerantz, 2012; Grolnick & Slowiaczek, 
1994). However, each reporter may also bring a unique set of 
biases to their reports. In the current context, the value children 
view parents placing on school achievement or that they them- 
selves place on it may bias their reports, such that effects reflect 
children’s perceptions of parents’ values or their own values rather 
than parents’ involvement. To rule out this possibility, we tested 
alternative pathways—for example, the value children place on 
school achievement predicts their reports of parents’ involvement 
over time. 


Method 


Participants 


The University of Illinois U.S.-China Adolescence Study began 
when children entered a new school in seventh grade and con- 
cluded at the end of eighth grade in the United States and China 
(e.g., Pomerantz et al., 2009; Wang & Pomerantz, 2009). Partici- 
pants were 374 American children (187 boys; M age = 12.78 years 
in the fall of seventh grade) and 451 Chinese children (240 boys; 
M age = 12.69 years in the fall of seventh grade). In each country, 
children attended public school in primarily working- or middle- 
class areas. The American children attended one of two public 
schools consisting of the seventh and eighth grades in the suburbs 
of Chicago. Chicago is a city with high population density (12,750 
people per square mile at the time of the research) with a median 
yearly family gross income of $61,182 at the time of the research; 
30% of the population over the age of 25 possessed at least a 
college degree at the time of the research (U.S. Census Bureau, 
2007). The median family income of the two selected suburbs was 
$60,057 and $72,947, with 21% and 26% of the population over 
the age of 25 possessing a college degree. Reflecting the ethnic 
composition of these areas, participants were predominantly Eu- 
ropean American (88%) with 9% Hispanic American, 2% African 


American,,and 1% Asian American. Seventy-nine percent of par- 
ticipating children reported living with two parents. 

The Chinese children attended one of two public schools in the 
suburbs of Beijing; one school consisted of the seventh to ninth 
grades and the other of the seventh to 12th grades. According to 
the Beijing Municipal Bureau of Statistics (2005), Beijing is a 
densely populated city (13,386 people per square mile at the time 
of the research) with an annual discretionary income per capita of 
$15,638 RMB at the time of the research; 13% of the population 
over the age of 6 had at least a college degree at the time of the 
research. In the two selected suburbs, 9% and 28% of the popu- 
lation over the age of 6 had a college degree. Over 95% of the 
residents in these areas were of the Han ethnicity (Beijing Munic- 
ipal Bureau of Statistics, 2005), which is slightly above the 92% 
for the country as a whole (China Population and Development 
Research Center, 2001). Eighty-six percent of the participating 
children reported living with two parents. An opt-in consent pro- 
cedure was used in which parents provided permission for children 
to participate. Sixty-four percent of parents in the United States 
and 59% of parents in China allowed their children to participate. 


Procedure 


Children completed a set of questionnaires during two 45-min 
sessions at four times approximately 6 months apart: fall of sev- 
enth grade (Wave 1), spring of seventh grade (Wave 2), fall of 
eighth grade (Wave 3), and spring of eighth grade (Wave 4). 
Instructions and items were read aloud to children in their native 
language in the classroom during regular class time by trained 
native research staff. Children received a small gift (e.g., a calcu- 
lator) as a token of appreciation at the end of each session. The 
average attrition rate over the entire study was 4% (2% in the 
United States and 6% in China). More than 85% of the children 
had data at all four waves of the study for all of the analyses, with 
more than 98% having data at two or more waves for all of the 
analyses. At Wave 1, children with complete data differed from 
those without complete data only in that their grades were better, 
(818) = 2.01, p < .05. The Institutional Review Boards of the 
University of Illinois and Beijing Normal University approved the 
procedures. 


Measures 


The measures were originally written in English. Standard trans- 
lation and back-translation procedures (Brislin, 1980) were em- 
ployed with repeated discussion among American and Chinese 
members of the research team to modify the wording of the items 
to ensure equivalence in meaning between the English and Chinese 
versions (Erkut, 2010). Equivalence was also established statisti- 
cally. A series of confirmatory factor analyses (CFAs) was con- 
ducted in the context of two-group nested structural equation 
modeling (SEM) to examine the metric invariance of the measures 
between the United States and China over the four waves of the 
study; metric invariance is essential and sufficient in making valid 
comparisons of the associations (e.g., Little, 1997), as was done in 
the current research (see below). 

In each set of CFAs, an unconstrained model was compared to 
a constrained (i.e., metric invariance) model. The unconstrained 
models consisted of the same latent construct repeatedly assessed 
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over the four waves yielding a total of four latent constructs. These 
constructs were allowed to correlate with one another; errors of the 
same indicators over time were also allowed to correlate when 
suggested by modification indexes from the CFAs conducted on 
the sample with no missing data (Keith, 2006; McDonald & Ho, 
2002). The parameters in the unconstrained models were freely 
estimated without any between-country or across-time equality 
constraints. In the constrained models, the factor loadings of the 
same indicators were forced to be equal between the two countries 
and across the four waves. Monte Carlo studies indicate that a 
decrease from the unconstrained to the corresponding constrained 
model in the comparative fit index (CFI) of no more than .01, 
supplemented by an increase in the root-mean-square error of 
approximation (RMSEA) of no more than .015, is reflective of 
invariance (Chen, 2007). Although chi-square difference tests are 
considered appropriate for hypothesis testing purposes, the current 
consensus is that they are not appropriate for evaluating measure- 
ment invariance (e.g., Chen, 2007; G. W. Cheung & Rensvold, 
2002; Little, 1997). 

Prior analyses on these data, using two parcels of items (or two 
items in the case of time spent on homework outside of school) to 
represent each latent construct indicated that the measures of 
parents’ involvement in children’s learning, the value children 
place on school, and children’s engagement have metric invariance 
between countries and over time (C. S. Cheung & Pomerantz, 
2011; Wang & Pomerantz, 2009). The use of parcels allowed us to 
build parsimonious models based on solid and meaningful indica- 
tors, enhancing the likelihood of replication in future research 
(Little, Cunningham, Shahar, & Widaman, 2002; Little, 
Rhemtulla, Gibson, & Schoemann, 2013). Parsimony was of par- 
ticular concern in the current research given the sizeable number of 
items comprising each scale and the complexity of the models, 
which can strain the number of free parameters that can be esti- 
mated (e.g., Kline, 1998), despite our sample size of 825. In such 
a case, the use of parcels is desirable (Little et al., 2013). Impor- 
tantly, principal components analysis (PCA) on each set of items 
comprising each parcel indicated that each set formed a single 
factor; the parcels were also each internally reliable on their own 
(as = .73 to .88). 

Metric invariance of children’s perceptions of the value parents 
place on school achievement has not been evaluated in prior 
analyses; thus, it was tested for the current research. The latent 
construct was represented by two parcels of items: Items about the 
importance of doing well were aggregated in one parcel, which 
PCA indicated formed a single factor (as = .80 to .93; see item 
descriptions below), and items about the importance of not doing 
poorly were aggregated in another, which PCA indicated form a 
single factor as well (as = .84 to .92). Both the unconstrained, 
x?(df = 9) = 26.10, CFI = .95, Tucker—Lewis index (TLI) = .92, 
RMSEA = .08, and constrained, x? (df = 13) = 38.12, CFI = .95, 
TLI = .92, RMSEA = .07, models fit the data adequately, with 
differences between the CFIs and RMSEAs of no more than .01. 

Parental involvement in child learning. Parents’ involve- 
ment in children’s learning was assessed with 10 items (e.g., “My 
parents help me with my homework when I ask.” “My parents try 
to get to know the teachers at my school.” “My parents purchase 
extra workbooks or outside materials related to school for me.”) 
adapted from prior research (Chao, 2000; Kerr & Stattin, 2000; 
Kohl, Lengua, McMahon, & The Conduct Problems Prevention 


Research Group, 2000; Stattin & Kerr, 2000). In line with Grolnick 
and Slowiaczek’s (1994) definition of parents’ involvement, the 
items characterize a variety of practices (e.g., attendance of parent- 
teacher conferences, discussion of school with children, and assis- 
tance with homework) reflecting parents’ commitment of re- 
sources to children in the academic arena. Children indicated the 
extent to which each of the statements was true (1 = not at all true, 
5 = very true). The 10 items were combined, with higher numbers 
reflecting greater involvement as reported by children (as = .83 to 
.85 in the United States and .77 to .83 in China). 

Child perceptions of parental value. To assess children’s 
perceptions of the value their parents place on school achievement, 
children indicated how important (1 = not at all important, 7 = 
very important) it is to parents that they do well (e.g., “How 
important is it to your parents that you do well in language arts?’) 
and avoid doing poorly (e.g., “How important is it to your parents 
that you avoid doing poorly in math?”) on four core subjects 
(language arts, math, science, and social studies in the United 
States; language arts, math, biology, and English in China) for 
which children received grades. The eight items were combined, 
with higher numbers reflecting perceptions of greater parental 
value (as = .93 to .96 in the United States and .87 to .91 in China). 

Child value. The value children themselves place on school 
achievement was assessed with a modified version of Pomerantz, 
Saxon, and Oishi’s (2000) measure. Paralleling the measure of 
children’s perceptions of the value parents place on school 
achievement, for each of the four core subjects, children indicated 
how important (1 = not at all important, 7 = very important) it 
was for them to do well (e.g., “How important is it to you to do 
well in math?”) and avoid doing poorly (e.g., “How important is it 
to you to avoid doing poorly in language arts?”). The eight items 
were combined, with higher numbers reflecting greater value 
(as = .91 to .94 in the United States and .88 to .91 in China). 

Child engagement. Two forms of children’s engagement in 
school were assessed. The 30-item Dowson and McInerney (2004) 
Goal Orientation and Learning Strategies Survey assessed chil- 
dren’s use of self-regulated learning strategies. Three scales assess 
children’s metacognitive strategies: Six items assess monitoring 
(e.g., “I check to see if I understand the things I am trying to 
learn”), six assess planning (e.g., “I try to plan out my schoolwork 
as best as I can”), and six assess regulating (e.g., “If I get confused 
about something at school, I go back and try to figure it out”). Two 
scales assess children’s cognitive strategies: Six items assess re- 
hearsal (e.g., “When I want to learn things for school, I practice 
repeating them to myself”) and six assess elaboration (e.g., “I try 
to understand how the things I learn in school fit together with 
each other’’). Children indicated the extent to which each of the 30 
statements was true of them (1 = not at all true, 5 = very true). 
The metacognitive and cognitive strategies scales were combined, 
with higher numbers representing greater school engagement 
(as = .96 to .97 in the United States and .93 to .96 in China). 

The time children spend on schoolwork outside of school was 
assessed with a modified version of the scale used by Fuligni, 
Tseng, and Lam (1999). Children indicated how much time they 
spend on their schoolwork outside of school on a typical weekday 
and weekend (1 = less than I hr, 6 = more than 5 hr). Their 
responses for a typical weekday were weighted by five and com- 
bined with those of each day for a typical weekend weighted by 
two. Higher numbers reflect more time spent on schoolwork out- 
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side of school (rs = .48 to .64 in the United States and .41 to .52 
in China). 

Child grades. Children’s grades in the four core subjects were 
obtained from schools. Grades in the American schools were 
originally in letters and were converted to numbers. Because there 
were 13 steps in the ladder of grades used in the American schools, 
grades were converted to numbers with a range of 0 (i.e., a grade 
of F) to 12 (i.e., a grade of A+) with a 1-point increment between 
each step in the grades (e.g., B— = 7,B = 8, B+ = 9, A— = 10). 
Such conversion has been used in prior research (e.g., Coe, Piv- 
arnik, Womack, Reeves, & Malina, 2006; Schwartz, Kelly, & 
Duong, 2013; Wood & Locke, 1987). Moreover, simulation re- 
search indicates that the treatment of discrete categories as con- 
tinuous is unlikely to result in biased parameter estimates when the 
number of categories is more than six as is the case in the current 
research (Rhemtulla, Brosseau-Liard, & Savalei, 2012). In the 
Chinese schools, grades were originally numerical, ranging from 0 
to 100 in one school and from 0 to 120 in the other. In both 
countries, grades were standardized within school to take into 
account differences in the grading systems of the schools. The four 
subjects were combined, with higher numbers reflecting better 
grades. 


Results 


Overall, the measures in the current research were approxi- 
mately normally distributed. In both the United States and China 
across the four waves of assessment, the indexes for skewness and 
kurtosis were less than 1, with only one exception—the index for 
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skewness \was 1.47 and the kurtosis index was 2.28 for the 
avoidant dimension of the value children place on school achieve- 
ment at Wave 1 in the United States. Hence, across the six 
measures at each of the four waves there was no indication of 
serious violation of the normality assumption. 

As shown in Table 1, in both the United States and China, 
parents’ involvement in children’s learning—as reported by chil- 
dren—was positively associated with children’s perceptions of the 
value parents place on school achievement (rs = .25 to .43, ps < 
.001) as well as the value children themselves place on it (rs = .25 
to .39, ps < .001) at each wave. Children’s perceptions of the 
value parents place on school achievement were positively asso- 
ciated at each wave with the value children place on it (rs = .28 
to .55, ps < .001). The value children place on school achievement 
was also associated with their engagement (rs = .30 to .58 for 
self-regulated learning strategies and .13 to .21 for time spent on 
school schoolwork outside of school; ps < .05) as well as grades 
(rs = .22 to .38, ps < .001) at each wave in the United States and 
China. Although such associations are suggestive of the viability 
of both the perception-acceptance and experience value develop- 
ment pathways, they do not provide insight into the direction of 
effects. Evaluation of the direction of effects requires analyses 
accounting for the autoregressive effects. 

The central analyses took such effects into account. These 
analyses were conducted within a latent SEM framework using 
Mplus 7.0 (Muthén & Muthén, 1998-2012), which employs full 
information maximum likelihood (FIML) estimation in the pres- 
ence of missing data; FIML provides more reliable standard errors 
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Note. Results are based on the observed, rather than latent, variables. Correlations for the American sample are presented in the lower triangle; those for 
the Chinese sample are presented in the upper triangle. Correlations with absolute values greater than .10 are significant (p < .05). Grades were standaniived 


within schools with means equal to zero and standard deviationss equal to one; the other dimensions of academic 
limitations, but information on them may be obtained by contacting the first author (see also the Results secti 


functioning were not included given space 
on). 
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to handling missing data under a wider range of conditions than 
does not only list- and pairwise deletion but also mean-imputation 
(Arbuckle, 1996; Wothke, 2000). To identify differences between 
the United States and China, two-group nested model comparisons 
were employed: The unconstrained models were compared to 
more parsimonious models with constraints of equal coefficients 
imposed between the two countries on the effects of interest; for 
each set of models, the constraints were imposed one by one and 
then simultaneously. A significant difference (Ay*) between an 
unconstrained model and a more parsimonious constrained model 
indicates a country difference. The same two parcels or items used 
in the CFAs conducted to establish measurement invariance (see 
the Method section) were employed for the latent constructs in the 
model; for grades, the four subjects were each used as indicators of 
the latent construct. A separate set of models was conducted for 
each dimension of academic functioning. 

Prior research using this data set already established the total 
effects of parents’ involvement on children’s engagement and 
grades over time. C. S. Cheung and Pomerantz (2012) conducted 
sets of two-group nested SEM analyses examining if parents’ 
involvement is predictive of children’s academic functioning over 
the four waves (see also C. S. Cheung & Pomerantz, 2011): The 
effect of parents’ involvement at Wave | on children’s academic 
functioning (i.e., engagement and grades) at Wave 4 was evalu- 
ated, taking into account residual variance by adjusting for chil- 
dren’s earlier (Wave 1) academic functioning as well as allowing 
the variance of parents’ involvement and children’s academic 
functioning at Wave 1 to correlate. The unconstrained, x*s(df > 
5) < 3.91, CFIs > .96, TLIs > .95, RMSEAs < .05, and 
constrained, y7s(dfs = 4) = 1.67, CFIs > .96, TLIs > .96, 
RMSEAs = .04, models fit the data well, with the effects similar 
in the United States and China, Ax?s(df = 1) < 1.5. The more 
involved parents were in children’s learning, the more children 
were engaged (y = .15 for self-regulated learning strategies and 
.08 for time spent on schoolwork; ts > 2.66, ps < .01), and the 
better their grades (y = .07; t = 3.01, p < .01) 2 years later over 
and above their earlier engagement and grades. 

In the current report, we used two-group nested SEM analyses 
to identify the role of the two value development pathways (i.e., 
the perception-acceptance and experience pathways) in explaining 
the effects of parents’ involvement on children’s academic func- 
tioning. As shown in Figure 2, children’s reports of parents’ 
involvement at Wave 1 were specified to predict children’s per- 
ceptions of the value parents place on school achievement (i.e., the 


Table 2 


first step of the perception-acceptance pathway, path a in Figures 
1 and 2) and the value children themselves place on school (i.e., 
the first step of the experience pathway, path c in Figure | and cl 
in Figure 2) at Wave 2. For the perception-acceptance pathway, 
children’s perceptions of parents’ values at Wave 2 were specified 
to predict their own values at Wave 3 (path b in Figures 1 and 2), 
which in turn were specified to predict children’s academic func- 
tioning at Wave 4 (path d in Figures 1 and 2). For the experience 
pathway, the value children place on school achievement at Wave 
2 was specified to predict the maintenance of such value at Wave 
3 (path c2 in Figure 2), which in turn was specified to predict their 
academic functioning at Wave 4 (path d in Figures 1 and 2). 

The mediating roles of the two pathways were simultaneously 
evaluated to assess the unique effects of each. Residual variance 
for each of the downstream constructs was taken into account. 
Specifically, as shown in Figure 2, corresponding constructs as- 
sessed 6 months prior to each of the constructs specified in the 
pathways were included to take into account autoregressive ef- 
fects. Concurrent associations between constructs were also taken 
into account by allowing the variances (Wave 1) or error variances 
(Wave 2, 3, and 4) of the constructs to correlate within each wave. 
Because children’s residence in single-headed household as well 
as their gender are associated with children’s achievement during 
middle school (e.g., Downey, 1994; Dwyer & Johnson, 1997; 
Entwisle, 1997), children’s reports of whether they reside with 
both parents in the same household (1 = not residing with both 
parents, 2 = residing with both parents) and their gender (1 = 
boys, 2 = girls) were included as covariates by specifying them to 
predict children’s grades at Wave 4. 

The unconstrained models (i.e., individual models for self- 
regulated learning strategies, time spent on schoolwork outside of 
school, and grades) fit the data adequately, y7s(dfs > 319) = 1080, 
CFIs > .94, TLIs > .92, RMSEAs < .08. Two-group nested 
model comparisons indicated that the links comprising both the 
perception-acceptance and experience pathways were similar in 
the United States and China, Ay’s(dfs = 1) < 2.2, ns; thus, all 
such effects were constrained to be equal between the two coun- 
tries in the final constrained models, x*s(dfs > 314) > 1069, 
CFIs > .96, TLIs > .94, RMSEAs < .07. As shown in Table 2 and 
Figure 2, there was support for the perception-acceptance pathway. 
Children’s reports of parents’ involvement at Wave 1 predicted 
children’s perceptions of the value parents place on school 
achievement at Wave 2 taking into account children’s earlier 
(Wave 1) perceptions (ts = 2.96, ps < .01). In turn, the more 


Summary of Model Fit and Parameter Estimates for the Value Development Models 
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children perceived parents as valuing school achievement at Wave 
2, the more they themselves valued it at Wave 3 taking into 
account the earlier (Wave 2) value children placed on school 
achievement (ts = 2.02, ps < .05). The value children placed on 
school achievement at Wave 3 predicted enhanced engagement 
(ts > 3.10, ps < .01) and grades (ts = 4.92, ps < .001) among 
children at Wave 4 over and above their earlier (Wave 1) engage- 
ment and grades. Notably, at Wave 3, neither parents’ involvement 
nor children’s perceptions of the value parents place on school 
achievement uniquely predicted children’s engagement or grades 
(ts < 1). Thus, although the value children placed on school 
achievement predicted their subsequent perceptions of the value 
parents place on it (see Figure 2), this was not a viable pathway 
by which parents’ involvement benefits children’s academic 
functioning. 

There was also support for the experience pathway (see Figure 
2 and Table 2). Parents’ involvement as reported by children at 
Wave | predicted the value children themselves placed on school 
achievement at Wave 2 taking into account children’s earlier 
(Wave 1) value (ts = 2.08, ps < .05). The value children placed on 
school achievement was maintained over time—that is from Wave 
2 to 3 (ts = 6.28, ps < .01), which, as reported above, predicted 
children’s engagement and grades at Wave 4. 

The total effects of parents’ involvement (Wave 1) on children’s 
academic functioning (Wave 4) were no longer evident in either 
country (ys < .03) with the inclusion of the value development 
pathways, which resulted in a reduction of at least 65% of the total 
effect for each of the three dimensions of children’s academic 
functioning. Mplus’s delta method indicated that the two-step 
perception-acceptance pathway was significant in the United 
States and China in explaining the role of involvement in chil- 
dren’s engagement as reflected in their self-regulated learning 
strategies (zs > 2.21, ps < .05) and grades (zs > 2.08, ps < .05). 
For engagement, as reflected in children’s time spent on school- 
work outside of the school, the perception-acceptance pathway 
was marginal (zs = 1.81, ps < .06). The one-step experience 
pathway was evident across all three dimensions of children’s 
academic functioning (zs > 3.38, ps < .01). These results are 
consistent with those yielded by analyses using bootstrap resam- 
pling techniques. For example, in the model focusing on grades as 
the final outcome, the estimate of the perception-acceptance path- 
way via perceptions of parental value and child value using 5,000 
bootstrap resamples was .004 (95% CI = .001, .009), and that of 
the experience pathway was .017 (95% CI = .001, .042). 

The model examined also allowed us to test the viability of 
alternative pathways—for example, the possibility that the value 
children place on school achievement leads them to report parents 
as more involved over time, which in turn leads children to view 
parents as more invested in their achievement, thereby enhancing 
children’s academic functioning. To examine these alternative 
explanations, we evaluated the role of all possible pathways in the 
link between parents’ involvement and children’s academic func- 
tioning, including the value development pathways, simultane- 
ously in the same model. The unconstrained, y7s (dfs > 360) = 
1203, CFIs > .92, TLIs > .90, RMSEA = .08, and constrained, 
x’s (dfs > 350) > 1191, CFIs > .92, TLIs > .90, RMSEAs < .08, 
models fit the data adequately. When simultaneously evaluated in 
the model with the two value development pathways, none of the 
alternative pathways (out of six possible pathways) was evident 


(zs < 1.10, ns). However, the two value development pathways 
remained significant (zs > 2.10, ps < .05), reflecting their unique- 
ness. Although none of the alternative pathways were evident, one 
link comprising one of them was: In both countries, children’s 
value at Wave 2 was predictive of their perceptions of parents’ 
value at Wave 3, adjusting for children’s earlier perceptions (‘ys = 
25-28, 18 2.18, ps <.05): 


Discussion 


The current research is the first empirical test of one of the most 
frequently proposed pathways—that is, value development— 
argued to underlie the benefits of parents’ involvement in chil- 
dren’s learning (e.g., Epstein, 1988; Grolnick & Slowiaczek, 
1994). Consistent with the two-step value transmission model put 
forth by Grusec and Goodnow (1994), there was evidence for a 
perception-acceptance pathway: The more involved parents 
were—as reported by children—the more children perceived them 
as placing heightened value on school achievement; this, in turn, 
was predictive of children coming to value school achievement 
more over time. In line with ideas that parents’ involvement may 
create experiences that foster value development among children 
(e.g., Hill & Tyson, 2009), there was also evidence that parents’ 
involvement contributes directly to the value children place on 
school achievement (i.e., the experience pathway). Both pathways 
uniquely accounted for the beneficial effect of parents’ involve- 
ment on children’s later academic functioning (i.e., engagement 
and grades). 

The effects of the two value development pathways were robust 
in that they remained even when alternative pathways were taken 
into account (e.g., the more children value school achievement, the 
more they see parents as valuing it, which heightens children’s 
reports of parents’ involvement, thereby enhancing their achieve- 
ment); the value development pathways were also not due to 
children’s gender or residence with both (vs. one) parent, which 
have been linked to children’s achievement (e.g., Downey, 1994; 
Dwyer & Johnson, 1997; Entwisle, 1997). Although comparable to 
those of prior research using stringent statistical controls to iden- 
tify indirect pathways over time (e.g., Davies, Woitach, Winter, & 
Cummings, 2008; NICHD Early Child Care Research Network, 
2003), the effects of the value development pathway were mod- 
est—perhaps because value development has been underway for 
some time once children reach adolescence with only incremental 
change occurring at this time. Even modest effects, however, may 
be critical to offsetting the devaluing of school that often occurs 
among children over adolescence (for a review, see Wigfield & 
Wagner, 2005). Moreover, incremental change can be meaningful 
as it may accumulate over time (Pomerantz, Qin, Wang, & Chen, 
2011). Moderation may also contribute to the modest effects. For 
example, drawing from Grusec and Goodnow (1994), when chil- 
dren have poor relationships with parents, parents’ involvement in 
their learning may be less likely to lead them to take on parents’ 
values as their own. 

Increasingly research has focused on understanding the pro- 
cesses underlying the benefits of parents’ involvement in chil- 
dren’s learning for children’s academic functioning (for a review, 
see Pomerantz et al., 2012). In this vein, children’s actual and 
perceived competencies have been identified as important mech- 
anisms (e.g., Dearing et al., 2006; Senechal & LeFevre, 2002). 
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Although it is possible that such mechanisms are distinct from the 
value development pathways identified in the current research, it is 
also possible that they work together. Value development may 
establish the foundation for growth in children’s competencies: 
Once children come to see achievement in school as personally 
important, they may be more receptive to parents’ instruction, 
which may develop their competencies, thereby allowing them to 
feel confident. Other mechanisms may also be a part of the value 
development pathways. For example, C. S. Cheung and Pomerantz 
(2012) found that the effect of parents’ involvement on children’s 
achievement was due in part to children adopting parent-oriented 
reasons (e.g., to meet parents’ expectations) for school achieve- 
ment; such motivation may be particularly likely to develop once 
children see parents as valuing school achievement, ultimately 
leading children to view achievement as personally important so 
that they may do well to satisfy parents who have committed 
substantial resources to their learning. 

Despite differences in the quantity and quality of American and 
Chinese parents’ involvement in children’s learning (for reviews, 
see Chao & Tseng, 2002; Pomerantz, Ng, & Wang, 2008), the 
value development pathways were similarly evident in the United 
States and China. Although Chinese parents tend to accompany 
their involvement in children’s learning with control more than do 
American parents (e.g., C. S. Cheung & Pomerantz, 2011), the 
more parents were involved, the more children viewed them as 
valuing school achievement and valued it themselves in both the 
United States and China. Moreover, children’s perceptions of the 
value parents place on school achievement were similarly predic- 
tive over time of the value children themselves placed on it in the 
two countries. Thus, it appears that regardless of the quantity or 
quality, parents’ involvement in children’s learning may be a 
unique dimension of parenting in that it conveys parents’ values 
while also having characteristics such as emotional support that 
may increase the accuracy of children’s perceptions of such values 
as well as their acceptance of them. 

The current research was guided by Grusec and Goodnow’s 
(1994) two-step process model by which parents transmit their 
values to children. However, it diverged from the model in that the 
actual value parents place on school achievement was not directly 
assessed, but rather assumed to be reflected in parents’ involve- 
ment in children’s learning. Although parents’ values likely drive 
their involvement, so do other forces—for example, children’s 
invitations to be involved, parents’ beliefs about their capacity to 
support children’s learning, and whether parents see it as their role 
to be involved (for a review, see Hoover-Dempsey & Sandler, 
1997). The current research did not examine the accuracy of value 
transmission, but rather what parents’ involvement conveyed to 
children about the value parents’ place on achievement in school. 
It is of note that the value parents place on children’s school 
achievement may not be conveyed if parents are not involved. 
Hence, simply valuing school achievement may not reap the same 
benefits as being involved. 


Limitations 


Several limitations should be considered in interpreting the 
results. Perhaps most significantly, with the exception of grades, 
children served as the sole reporters. To rule out informant bias, 


our model controlled for the concurrent associations between the 
child-reported constructs as well as the stability of each over time 
as both these links are likely to contain informant bias. Given such 
controls, the value development pathways are unlikely to contain 
informant bias. However, we went further in ruling out the possi- 
bility of other pathways that could result in bias due to children’s 
reports—for example, we ensured that the effects were not simply 
due to children’s values driving their reports of parents’ values and 
involvement. Despite the merits of using multiple informants, it 
was crucial that children report on both their perceptions of the 
value parents place on school as well as the value they themselves 
place on school given that these constructs represent children’s 
beliefs to which they likely have the best access. Yet, because 
children’s reports of parents’ involvement are only modestly as- 
sociated with parents and teachers’ reports (e.g., Bakker et al., 
2007; Hill et al., 2004; Reynolds, 1992), it will be important for 
future research to examine the value development pathways using 
parents and teachers’ reports. 

The current research also did not distinguish between mothers 
and fathers’ involvement, asking children instead to report on 
involvement as practiced by parents as a single entity. It is quite 
possible that mothers and fathers are differentially involved re- 
flecting differences in their time and values. For example, Roest, 
Dubas, Gerris, and Engels (2009) reported only modest correspon- 
dence between Dutch mothers and fathers’ values in terms of such 
things as the importance of pursuing happiness and working hard. 
Research in the United States indicates that mothers and fathers’ 
involvement in children’s learning does not necessarily overlap— 
for example, mothers are often more likely than fathers to attend 
school events and assist children with homework (Nord & West, 
2001). Future research should examine if mothers and fathers’ 
involvement differentially guides value development among chil- 
dren. Attention should also be given to the moderating role of the 
consistency between mothers and fathers in their values and in- 
volvement because when there is more agreement between parents 
in their values, children often have more accurate perceptions of 
parents’ values (Knafo & Schwartz, 2004). 

Given their homogeneity (e.g., the American sample was 
mainly of European descent and the Chinese sample was mainly 
of Han descent), the samples used in the current research do not 
represent the diversity of the United States and China. Thus, 
questions remain concerning within-culture variations in the 
role of value development in the effect of parents’ involvement 
in children’s learning. Within the United States, there is some 
evidence that how parents are involved varies demographically 
(e.g., Hill & Taylor, 2004; Snyder & Dillow, 2012). For exam- 
ple, the more educated parents are, the more they take part in 
events at children’s school (Snyder & Dillow, 2012). It is 
possible that different types of involvement convey different 
messages about the value parents’ place on school (e.g., those 
that children see as taking more time and energy indicate most that 
parents view school as important). Of additional concern, is that urban 
areas such as Beijing in China have been increasingly exposed to 
Western values in the past few decades. Thus, it is possible that the 
Chinese children in the current research interpret parents’ involve- 
ment more similarly to American children than do Chinese chil- 
dren residing in rural areas. 
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Conclusions 


Despite these limitations, the current research is of import in 
providing empirical support for a value development model of the 
effects of parents’ involvement in children’s learning: Such in- 
volvement appears to benefit children in part because it leads 
children to view school achievement as valuable, which heightens 
their engagement in school, ultimately enhancing their grades. Via 
a perception-acceptance pathway, when parents become involved 
in children’s learning, children perceive parents as placing height- 
ened value on achievement in school; such perceptions in turn 
foreshadow children viewing achievement in school as personally 
important. In an experience pathway, parents’ involvement fore- 
shadows children placing heightened value on school achievement 
presumably due to the experiences created by parents’ involve- 
ment (e.g., discussion about school allows children to generate 
reasons for its utility), which in turn predicts enhanced academic 
functioning among children. These value development pathways 
were similarly evident in the United States and China where the 
quantity and quality of parents’ involvement differ. 
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